Editorial
© Schattauer 2009
Biomedical Data Mining N. Peek1; C. Combi2; A. Tucker3 1Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands; 2Department of Computer Science, University of Verona, Verona, Italy; 3School of Information Systems, Computing and Mathematics, Brunel, London, UK
Keywords Data mining, machine learning
Summary Objective: To introduce the special topic of Methods of Information in Medicine on data mining in biomedicine, with selected papers from two workshops on Intelligent Data Analysis in bioMedicine (IDAMAP) held in Verona (2006) and Amsterdam (2007). Methods: Defining the field of biomedical data mining. Characterizing current developments and challenges for researchers in the field. Reporting on current and future activities of IMIA’s working group on Intelligent Data Analysis and Data Mining. Describing the content of the selected papers in this special topic. Results and Conclusions: In the biomedical field, data mining methods are used to develop clinical diagnostic and prognostic systems, to interpret biomedical signal and image data, to discover knowledge from biological and clinical databases, and in biosurveillance and anomaly detection applications. The main challenges for the field are i) dealing with very large search spaces in a both computationally efficient and statistically valid manner, ii) incorporating and utilizing medical and biological background knowledge in the data analysis process, iii) reasoning with time-oriented data and temporal abstraction, and iv) developing enduser tools for interactive presentation, interpretation, and analysis of large datasets.
Correspondence to: Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam P.O. Box 22700 1100 DE Amsterdam The Netherlands E-mail:
[email protected] Methods Inf Med 2009; 48: 225–228
What Is Data Mining? The goal of this special topic is to survey the current state of affairs in biomedical data mining. Data mining is generally described as the (semi-)automatic process of discovering interesting patterns in large amounts of data [1–4]. It is an essential activity to translate the increasing abundance of data in the biomedical field into information that is meaningful and valuable for practitioners. Traditional data analysis methods, such as those originating from statistics, often fail to work when datasets are sizeable, relational in nature, multimedial, or object-oriented. This has led to a stormy development of novel data analysis methods that are increasingly receiving attention in the biomedical literature. Data mining is a young and interdisciplinary field, drawing from fields such as database systems, data warehousing, machine learning, statistics, signal analysis, data visualization, information retrieval, and highperformance computing. It has been successfully applied in diverse areas such as marketing, finance, engineering, security, games, and science. And rather than comprising a clearcut set of methods, the term “data mining” refers to an eclectic approach to data analysis where choices are led by pragmatic considerations concerning the problem at hand. Broadly speaking, the goals of data mining can be classified into two categories: description and prediction [2–4]. Descriptive data mining attempts to discover implicit and previously unknown knowledge, which can be used by humans in making decisions. In this case, data mining is part of a larger knowledge discovery process that includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and presentation of discovered knowledge to end-users. To arrive at usable results, it is essential that the discovered patterns are comprehensible by humans. Typical descriptive data mining tasks are unsupervised machine learning problems such as mining frequent Methods Inf Med 3/2009
225
226
Editorial
patterns, finding interesting associations and correlations in data, cluster analysis, outlier analysis, and evolution analysis. Predictive data mining seeks to find a model or function that predicts some crucial but (yet) unknown property of a given object, given a set of currently known properties. In prognostic data mining, for instance, one seeks to predict the occurrence of future medical events before they actually occur, based on patients’ conditions, medical histories, and treatments [5]. Predictive data mining tasks are typically supervised machine learning problems such as regression and classification. Well-known supervised learning algorithms are decision tree learners, rule-based classifiers, Bayesian classifiers, linear and logistic regression analysis, artificial neural networks, and support vector machines. The models that result from predictive data mining may be embedded in information systems and need not, in that case, to be always comprehensible by humans, even though a sound motivation of the provided prediction is often required in the medical field. The distinction between descriptive and predictive data mining is not always clear-cut. Interesting patterns that were found with descriptive data mining techniques can sometimes be used for predictive purposes. Conversely, a comprehensible predictive model (e.g. a decision tree) may highlight interesting patterns and thus have descriptive qualities. It may also be useful to alternate between descriptive and predictive activities within a data mining process. In all cases, the results of descriptive and predictive data mining should be valid on new, unseen data in order to be valuable to, and trusted by, end-users.
Data Mining in Biomedicine Data mining can be applied in biomedicine for a large variety of purposes, and is thus connected to diverse biomedical subfields. Traditionally, data mining and machine learning applications focused on clinical applications, such as decision support to medical practitioners and interpretation of signal and image data. More recently, applications in epidemiology, bioinformatics, and biosurveillance have received increasing attention. Methods Inf Med 3/2009
Clinical data mining applications are mostly predictive in nature and attempt to derive models that use patient-specific information to predict a patient’s diagnosis, prognosis, or any other outcome of interest and to thereby support clinical decision-making [6]. Historically, diagnostic applications have received most attention [7–9], but in the last decade prognostic models are becoming more popular [5, 10, 11]. Other tasks that are addressed with clinical data mining are detection of data artifacts [12] and adverse events [13], discovering homogeneous subgroups of patients [14], and extracting meaningful features from signal and image data [15]. Several characteristic features of clinical data may complicate the data mining process, such as the frequent and often meaningful occurrence of missing values, and the fact that data values (e.g. diagnostic categories) may stem from structured and very large medical vocabularies such as the ICD [16]. Furthermore, when the data were collected in routine care settings, it may be misleading to draw conclusions from the data with respect to causal effects of therapies. Data from randomized controlled studies enable researchers to compute unbiased estimates of causal effects, as these studies ensure exchangeability of patient groups [17]. In observational data, however, the analysis is biased due to the lack of this exchangeability. In recent years, biomedical data mining has received a strong impulse from research in molecular biology. In this field, datasets fall into three classes: i) sequence data, often represented by a collection of single nucleotide polymorphisms (SNPs) [18]; ii) gene expression data, which can be measured with DNA microarrays to obtain a snapshot of the activity of all genes in one tissue at a given time [19], and iii) protein expression data, which can include a complete set of protein profiles obtained with mass spectra technologies, or a few protein markers [20]. Initially, most genomic and proteomic research focused upon working with individual data sources and achieved considerable success. However, a number of key barriers have been met concerning for example the variability in microarray data [21] and the enormous search spaces involved with identifying protein-protein interactions and folding which require substantial data samples. An alternative approach to dealing directly with ge-
nomic and proteomic data is to perform literature mining which aims to discover related genes and proteins through analysis of biomedical publications [22]. Recent developments have explored methods to combine data sources such as meta-analysis and consensus algorithms for homogenous data [23] and Bayesian priors for heterogeneous data [24]. Another major recent development that aims to combine data and knowledge is system biology. This is an emerging field that attempts to model an entire organism (or a major system within an organism) as a whole [25]. It is starting to show genuine promise when combined with data mining [26], particularly in certain biological subsystems such as the cardiovascular and immune systems. Although many data mining concepts are today well-established and toolsets are available to readily apply data mining algorithms [27, 28], various challenges remain for researchers in the biomedical data mining field. First and foremost, biomedical datasets continue to grow in terms of the number of variables (measurements) per patient. This results in exponentially growing search spaces of hypotheses that are explored by data mining algorithms. An important challenge is to deal with these very large search spaces in a manner that is both computationally efficient and statistically valid. Second, knowledge discovery activities are only meaningful when they take advantage of existing background knowledge in the application area at hand [29]. Biomedical knowledge typically resides in structured medical vocabularies and ontologies, clinical practice guidelines and protocols, and results from scientific studies. Few existing data mining methods are capable of utilizing any of these forms of background knowledge. Third, a most characteristic feature of medical data is its temporal dimension. All clinical observations and interventions must occur at some point in time or during a time period, and the medical jargon abounds with references to time and temporality [30]. Although the attention to temporal reasoning and data analysis has increased over the last decade [31–33], there is still a lack of established data mining methods that deal with temporality. The fourth and final challenge is the development of software tools for end users (such as biologists and medical professionals). With the growing amounts of data available, there is an © Schattauer 2009
Editorial
increasing need for interactive tools that support users in the presentation, interpretation, and analysis of datasets.
IMIA’s Working Group on Intelligent Data Analysis and Data Mining In 2000, a Working Group on Intelligent Data Analysis and Data Mining was established as part of the International Medical Informatics Association (IMIA) [34]. The objectives of the working group are i) to increase the awareness and acceptance of intelligent data analysis and data mining methods in the biomedical community, ii) to foster scientific discussion and disseminate new knowledge on AI-based methods for data analysis and data mining techniques applied to medicine, iii) to promote the development of standardized platforms and solutions for biomedical data analysis, iv) to provide a forum for presentation of successful intelligent data analysis and data mining implementations in medicine. The working group’s main activity is organization of a yearly workshop called Intelligent Data Analysis in bioMedicine and Pharmacology (IDAMAP) [35]. IDAMAP workshops are devoted to computational methods for data analysis in medicine, biology and pharmacology that present results of analysis in the form communicable to domain experts and that somehow exploit knowledge of the problem domain. Typical methods include data visualization, data exploration, machine learning, and data mining. Gathering in an informal setting, workshop participants have the opportunity to meet and discuss selected technical topics in an atmosphere that fosters the active exchange of ideas among researchers and practitioners. IDAMAP workshops have been organized since 1996. The most recent workshops were held in Aberdeen (2005), Verona (2006), Amsterdam (2007), and Washington (2008). Other activities of the working group include the organization of tutorials and panel discussions at international conferences on the topics of intelligent data analysis and data mining in biomedicine. In all its activities, there is a close collaboration with the Working Group on Knowledge Discovery and Data Mining of AMIA [36]. © Schattauer 2009
Selected Papers From a total of 35 papers presented at the IDAMAP-2006 and IDAMAP-2007 workshops, the ten best papers were selected based on the workshop review reports, and the authors were invited to submit an extended version of their paper for the special topic. Eight authors responded positively, from which five papers were finally accepted after blinded peer review. To our opinion, these papers form a representative sample of the current developments in biomedical data mining. The paper by Curk et al. [37] considers the problem of knowledge discovery from gene expression data, by searching for patterns of gene regulation in microarray datasets. Knowledge of gene regulation mechanisms is essential for understanding gene function and interactions between genes. Curk et al. present a novel descriptive data mining algorithm, called rule-based clustering, that finds groups of genes sharing combinations of promoter elements (regions of DNA that facilitate gene transcription). The main methodological challenge is the vast number of candidate combinations of genes and promoter regions, which is handled by the algorithm by employing a heuristic search method. Interesting features of this algorithm are that it yields a symbolic cluster representation, and, in contrast to traditional clustering techniques, allows for overlapping clusters. Also Bielza et al. [38] discuss on the analysis of microarray gene expression data, but focus on predictive data mining, using logistic regression analysis. As discussed in the previous section, microarray datasets have created new methodological challenges for existing data analysis algorithms. In particular, the number of data attributes (genes) is typically much larger than the number of observations (patients), potentially resulting in unreliable statistical inferences due to a severe ‘multiple testing’ problem. One popular solution in biostatistics is regularization of the model parameters by setting a penalty on total size of the estimated regression coefficients. However, estimation of regularized model parameters is a complex numeric optimization problem. The paper by Bielza et al. presents an evolutionary algorithm to solve the problem. The third paper, by Andreassen et al. [39], is clinically oriented and uses a Bayesian
learning method to solve a well-known problem in pharmacoepidemiology: discovering which bacterial pathogenic organisms can be treated with particular antibiotic drugs. Again, the large number of possible combinations that need to be considered poses problems for traditional data analysis methods. More specifically, many pathogen-drug combinations will not even occur in the data, or in such small numbers that reliable direct inferences are not possible. Andreassen et al. propose to solve this problem by borrowing statistical strength from observations on similar pathogens using hierarchical Dirichlet learning. Castellani et al. [40] consider the identification of tumor areas in dynamic contrast enhanced magnetic resonance imaging (DCEMRI), a technique that has recently expanded the range and application of imaging assessment in clinical research. DCE-MRI data consists of serial sets of images obtained before and after the injection of a paramagnetic contrast agent. Rapid acquisition of images allows an analysis of the variation of the MR signal intensity over time for each image voxel, which is indicative for the type of tissue represented by the voxel. Castellani et al. use support vector machines to classify the signal intensity time curves associated with image voxels. The fifth and final paper of the special topic, written by Klimov et al. [41], deals with the visual exploration of temporal clinical data. They present a new workbench, called VISITORS (VISualizatIon of Time-Oriented RecordS), which integrates knowledge-based temporal reasoning mechanisms with information visualization methods. The underlying concept is the temporal association chart, a list of raw or abstracted observations. The VISITORS system allows users to interactively visualize temporal data from a set of patient records.
References 1. Fayyad U, Piatetsky-Shapiro G, Smyth P. The KDD process for extracting useful knowledge from volumes of data. Commun ACM 1996; 39 (11): 27–34. 2. Hand DJ, Mannila H, Smyth P. Principles of Data Mining. Cambridge, Massachusetts: MIT Press; 2001. 3. Giudici P. Applied Data Mining Statistical Methods for Business and Industry. London: John Wiley & Sons; 2003.
Methods Inf Med 3/2009
227
228
Editorial
4. Han J, Kamber M. Data Mining. Concepts and Techniques. San Francisco, California: Morgan Kaufmann Publishers; 2006. 5. Abu-Hanna A, Lucas PJ. Prognostic models in medicine: AI and statistical approaches. Methods Inf Med 2001; 40 (1): 1–5. 6. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2008; 77 (2): 81–97. 7. Lavrac N, Kononenko I, Keravnou E, Kukar M, Zupan B. Intelligent data analysis for medical diagnosis: using machine learning and temporal abstraction. AI Commun 1998; 11: 191–218. 8. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med 2001; 23: 89–109. 9. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the Bayesian network model. Methods Inf Med 2007; 46 (6): 723–726. 10. Pfaff M , Weller K, Woetzel D, Guthke R, Schroeder K, Stein G, Pohlmeier R, Vienken J. Prediction of cardiovascular risk in hemodialysis patients by data mining. Methods Inf Med 2004; 43 (1): 106–113. 11. Tjortjis C, Saraee M, Theodoulidis B, Keane JA. Using T3, an improved decision tree classifier, for mining stroke-related medical data. Methods Inf Med 2007; 46 (5): 523–529. 12. Verduijn M, Peek N, de Keizer NF, van Lieshout EJ, de Pont AC, Schultz MJ, de Jonge E, de Mol BA. Individual and joint expert judgments as reference standards in artifact detection. J Am Med Inform Assoc 2008; 15 (2): 227–234. 13. Jakkula V, Cook DJ. Anomaly detection using temporal data mining in a smart home environment. Methods Inf Med 2008; 47 (1): 70–75. 14. Nannings B, Bosman RJ, Abu-Hanna A. A subgroup discovery approach for scrutinizing blood glucose management guidelines by the identification of hyperglycemia determinants in ICU patients. Methods Inf Med 2008; 47 (6): 480–488. 15. Lessmann B, Nattkemper TW, Hans VH, Degenhard A. A method for linking computed image fea-
Methods Inf Med 3/2009
tures to histological semantics in neuropathology. J Biomed Inform 2007; 40 (6): 631–641. 16. www.who.int/whosis/icd10. Last accessed Mar 3, 2009. 17. Hernán MA. A definition of causal effect for epidemiological research. J Epidemiol Community Health 2004; 58: 265–271. 18. Barker G, Batley J, O’Sullivan H, Edwards KJ, Edwards D. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 2003; 19 (3): 421–422. 19. Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. J Comput Biol 2000; 7 (3–4): 601–620. 20. Lobley A, Swindells MB, Orengo CA, Jones DT. Inferring function using patterns of native disorder in proteins. PLoS Comput Biol 2007; 3 (8): e162. 21. Choi JK, Yu U, Kim S, Yoo OJ. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 2003; 19 (Suppl 1): i84–i90. 22. Jelier R, Schuemie MJ, Roes PJ, van Mulligen EM, Kors JA. Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 2008; 77 (5): 354–362. 23. Steele E, Tucker A. Consensus and meta-analysis regulatory networks for combining multiple microarray gene expression datasets. J Biomed Inform 2008; 41 (6): 914–926. 24. Husmeier D, Werhli AV. Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks with Bayesian networks. In: Markstein P, Xu Y, editors. Computational Systems Bioinformatics, Volume 6: Proceedings of the CSB 2007 Conference. World Scientific; 2007. pp 85–95. 25. Kitano H. Computational systems biology. Nature 2002; 420: 206–210. 26. Zhang M. Interactive analysis of systems biology molecular expression data. BMC Systems Biol 2008; 2: 2–23. 27. Witten IH, Frank E. Data Mining. Practical Machine Learning Tools and Techniques. San Franciso, California: Morgan Kaufmann Publishers; 2005.
28. http://www.ailab.si/orange. Last accessed Mar 3, 2009. 29. Zupan B, Holmes JH, Bellazzi R. Knowledge-based data analysis and interpretation. Artif Intell Med 2006; 37 (3): 163–165. 30. Shahar Y. Dimensions of time in illness: an objective view. Ann Intern Med 2000; 132 (1): 45–53. 31. Combi C, Shahar Y. Temporal reasoning and temporal data maintenance in medicine: issues and challenges. Comput Biol Med 1997; 27 (5): 353–368. 32. Adlassnig KP, Combi C, Das AK, Keravnou ET, Pozzi G. Temporal representation and reasoning in medicine: Research directions and challenges. Artif Intell Med 2006; 38 (2): 101–113. 33. Stacey M, McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey. Artif Intell Med 2007; 39 (1): 1–24. 34. http://magix.fri.uni-lj.si/idadm . Last accessed Mar 3, 2009. 35. http://www.idamap.org. Last accessed Mar 3, 2009. 36. http://www.amia.org/mbrcenter/wg/kddm. Last accessed Mar 3, 2009. 37. Curk T, Petrovic U, Shaulsky G, Zupan B. Rulebased clustering for gene promoter structure discovery. Methods Inf Med 2009; 48: 229–235. 38. Bielza C, Robles V, Larrañaga P. Estimation of distribution algorithms as logistic regression regularizers of microarray classifiers. Methods Inf Med 2009; 48: 236–241. 39. Andreassen S, Zalounina A, Leibovici L, Paul M. Learning susceptibility of a pathogen to antibiotics using data from sim lar pathogens. Methods Inf Med 2009; 48: 242–247. 40. Castellani U, Cristani M, Daducci A, Farace P, Marzola P, Murino V, Sbarbati V. DCE-MRI data analysis for cancer area classification. Methods Inf Med 2009; 48: 248–253. 41. Klimov D, Shahar Y, Taieb-Maimon M. Intelligent interactive visual exploration of temporal associations among multiple time-oriented patient records. Methods Inf Med 2009; 48: 254–262.
© Schattauer 2009
Original Articles
© Schattauer 2009
Rule-based Clustering for Gene Promoter Structure Discovery T. Curk1; U. Petrovic2; G. Shaulsky3; B. Zupan1, 3 1University
of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia; Stefan Institute, Department of Molecular and Biomedical Sciences, Ljubljana, Slovenia; 3Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, Texas, USA 2J.
Keywords Promoter analysis, gene expression analysis, machine learning, rule-based clustering
Summary Background: The genetic cellular response to internal and external changes is determined by the sequence and structure of gene-regulatory promoter regions. Objectives: Using data on gene-regulatory elements (i.e., either putative or known transcription factor binding sites) and data on gene expression profiles we can discover structural elements in promoter regions and infer the underlying programs of gene regulation. Such hypotheses obtained in silico can greatly assist us in experiment planning. The principal obstacle for such approaches is the combinatorial explosion in different com-
Correspondence to: Tomaz Curk University of Ljubljana Faculty of Comp. and Inf. Science Trzaska c. 25 1000 Ljubljana Slovenija E-mail:
[email protected]
1. Introduction Regulation of gene expression is a complex mechanism in the biology of eukaryotic cells. Cells carry their function and respond to the environment by an orchestration of transcription factors and other signaling molecules that influence gene expression. The resulting products regulate expression of other genes thus forming diverse sets of regulatory pathways. To better understand gene function and gene interactions we need to uncover and analyze the programs of gene regulation. Computational analysis [1] of gene-regulatory regions that can use information from known gene sequences, putative binding sites
binations of promoter elements to be examined. Methods: Stemming from several state-ofthe-art machine learning approaches we here propose a heuristic, rule-based clustering method that uses gene expression similarity to guide the search for informative structures in promoters, thus exploring only the most promising parts of the vast and expressively rich rule-space. Results: We present the utility of the method in the analysis of gene expression data on budding yeast S. cerevisiae where cells were induced to proliferate peroxisomes. Conclusions: We demonstrate that the proposed approach is able to infer informative relations uncovering relatively complex structures in gene promoter regions that regulate gene expression.
Methods Inf Med 2009; 48: 229–235 doi: 10.3414/ME9225 prepublished: April 20, 2009
and sets of gene expression studies, can greatly speed-up and automate the tedious discovery process performed by classical genetics. The regulatory region of a gene is defined as a stretch of DNA, which is normally located upstream of the gene’s coding region. Transcription factors are special proteins that bind to specific sequences (binding sites) in the regulatory regions, thus inhibiting or exciting gene expression of target genes. Regulation by binding of transcription factors is just one of the many regulatory mechanisms. Expression is also determined by chromatin structure [2], epigenetic effects, post-transcriptional, translational, post-translational and other forms of regulation [3]. Because there is a
general lack of these kinds of data, most current computational studies focus on inference of relations between gene-regulatory content and gene expression measured using DNA microarrays [4]. Determination of the regulatory region and putative binding sites are the first crucial steps in such analyses. Regulatory and coding regions differ in nucleotide and codon frequency. This fact is successfully exploited by many prediction algorithms [5], and promoter (regulatory) sequences are readily available in public data bases for most model organisms. The next crucial, well studied, and notoriously difficult step is to determine the transcription factors’ putative binding sites in promoter regions. These are 4–20 nucleotidelong DNA sequences [3] which are highly conserved in the promoter regions of regulated genes. A matrix representation of the frequencies of the four nucleotides (A, T, C, G) at each position in the binding site is normally used in computational analysis. The TRANSFAC data base [6] is a good source of experimentally confirmed and computationally inferred binding sites. Candidate binding sites for genes with unknown regulations can be found using local sequence alignment programs such as MEME [7]. A detailed description and evaluation of such tools is presented in the paper by Tompa et al. [8]. Most contemporary methods that try to relate gene structure and expression start with gene expression clustering and then determine cluster-specific binding sites [4, 9]. The success of such approaches strongly relies on the number and composition of gene clusters. Slight parameter changes in clustering procedures can lead to significantly different clustering [10, 11], and consequently to inference of different cluster-specific binding sites. Most often these methods search for non-overlapping clusters and may miss interesting relations, as it is known that genes can respond in many different ways and perform various functions [12]. Methods Inf Med 3/2009
229
230
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
An alternative to clustering-first approaches are methods that start with information on binding sites and search for descriptions shared by similarly expressed genes. For example, in an approach by Chiang et al. [13] the group’s pairwise gene expression intra-correlation is computed for each set of genes comprising a specific binding site in the promoter region. Their method reports on binding sites where this correlation is statistically significant, but fails to investigate the combinations of two or more putative binding sites: it is known that regulation of gene expression can be highly combinatorial and requires the coordinated presence of many transcription factors. There are other approaches where combinations of binding sites are investigated, but they are often limited to the presence of two sites due to the combinatorial explosion of the search [4, 14]. For example, the number of all possible combinations of three binding sites, from a base of a thousand binding sites available for modeling, quickly grows into hundreds of millions. Transcription is also affected by absolute or relative orientation and distance between binding sites and other landmarks in the promoter region (i.e., the translation start ATG), further complicating the language that should be used to model promoter structure and subsequently increasing the search space.
To overcome the limitations described above, we have devised a new algorithm that can infer potentially complex promoter sequence patterns and relate them to gene expression. In the approach, which we call rulebased clustering (RBC), we essentially borrowed from several approaches developed within machine learning that use heuristic search to cope with potentially huge search space. The uniqueness of the presented algorithm is its ability to discover groups of genes that share any combination of promoter elements that can be in placement and orientation specific to the start of the gene or to another promoter element. Below, we first define the language we use to describe the constitution of promoter region, then describe the RBC algorithm and finally illustrate its application on the analysis of peroxisome proliferation data on S. cerevisiae.
2. Rule-based Clustering Method The inputs to the proposed rule-based clustering (RBC) method are gene expression profiles and data on their promoter-regulatory elements. The algorithm does not include any preprocessing of expression data (e.g., normalization, scaling) and considers the data as provided. For each gene, the data on regulatory el-
Fig. 1 Example of a rule search trace. Rule refinements that result in a significant increase in gene expression coherence (check mark) are explored further. Search along unpromising branches is stopped (cross). Methods Inf Med 3/2009
ements is given as a set of sequence motifs with their position relative to the start of the gene and orientation. The motifs are represented either by a position weight matrix [7] or a single line consensus; the former was used in all our experiments. The RBC algorithm aims to find clusters of similarly expressed genes with structurally similar promoter regions. The output of the algorithm are rules of the form “IF structure THEN expression profile”, where structure is an assertion over the regulatory elements in the gene promoter sequence and expression profile is a set of expression profiles of matching genes.
2.1 Descriptive Language for Assertions on Promoter Structure RBC discovers rules that contain assertions/ conditions on the structure of the promoter region that include the presence of binding sites, the distance of the binding sites from transcription and translation start site (ATG), the distance between binding sites, and the orientation of binding sites. We have devised a simple language to represent these assertions. For instance, the expression “S1” says that site S1 (in whichever orientation) must be present in the promoter, and the expression “S1- @- d1(ref:S2)” asserts that both sites S1 and S2 should be present in the promoter region such that S1, in the nonsense direction, appears d1 nucleotides upstream of S2. The proposed description language is not unequivocal: the same promoter structure may often be described in several different ways. For example, any of the following rules may describe the same structure : “S1+@- d1(ref:ATG) and S2- @d2(ref:S1),” “S2- @- d3(ref:ATG) and S1+@- d2(ref:S2),” and “S1+@- d1(ref:ATG) and S2- @- d3(ref: ATG)”. All three descriptions require sites S1 and S2 to be oriented in the sense and nonsense directions, respectively. The first rule requires site S1 to be positioned at distance d1 from the reference ATG (translation start site) and the position S2 to be relative to S1. According to the second rule, the position of S1 is relative to the absolutely positioned S2 at distance d3 from ATG. The third rule defines the position of both sites relative to ATG. In such cases, the RBC algorithm will return only one of the semantically equiva© Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
Fig. 2 Outline of the RBC algorithm
lent descriptions, depending on the order in which they were found in the heuristic search.
2.2 RBC Algorithm The proposed algorithm is outlined in 씰Figure 2. In its input it requires data on gene expression profiles Pall and data on promoter elements in the corresponding gene-regulatory regions. The algorithm returns a list of inferred rules of the form R = (C, P) with condition on the promoter structure C contained in genes with similar gene expression profiles P. RBC uses a beam-search approach (lines 3–12) followed by two post-processing steps © Schattauer 2009
(lines 13 and 14 of the algorithm). Beam is a list of at most L currently inferred rules considered for further refinement that are ordered according to their associated scores (see below). Parameter L is a user-defined parameter (with a default value of 1000) that affects the scope of the search and thus the runtime. At the start of the search Beam is initialized with a rule “IF True THEN Pall” that covers all genes under consideration. In every iteration of the main loop (lines 3 to 12), the search focuses on the best-scored rule R = (C, P) from Beam and considers all possible single-term extensions of its condition C, which are allowed by the given descriptive language. Each such refinement results in a new candidate rule, which is added
into the list of Candidates (line 6). The refinements include adding the terms with assertion on the presence of a site, presence of a site with its orientation, or the presence of a site (with or without the information on orientation) at a relative distance of a specific landmark (another site or start of gene). Refined rules are then represented in a simplified form. For instance, adding a single-site presence condition S1 to the initial rule “(True, Pall )” yields a rule “True and S1” which is simplified to its logical equivalent “S1”. Adding a term with the same site but non-sense orientation to the latter yields the rule “S1 and S1–” which is simplified to “S1–”. Similarly, adding a term with the same site but with information on a distance of 100 to 80 nucleoMethods Inf Med 3/2009
231
232
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
tides to the ATG may result in a rule such as “S1@ – 100.. – 80(ref:ATG)”. Requirements of other binding sites may be added, either simply by requiring their presence (e.g., rule “S1 and S2”) or by adding them as a reference to the presently included sites in conditions (e.g., “S1@ – 100.. – 80(ref:S2)”). Candidate rules will include those with matching at least N genes, where N being a user-defined parameter with a default value of six. Candidate rules are then compared to their (non-refined) parent rule based on the intra-cluster pairwise gene expression profiles distance of the covered genes. To identify co-expressed genes, the algorithm uses Pearson correlation as a default distance measure, which – when computing the distance between two genes – ignores experiments where for any of these two genes the expression is missing. The user can replace it with any other type of distance function that suits the particular type of expression profiles or the biological question addressed. For a set of candidate rules, only those with a significant reduction of this distance are retained in the list of Candidates (line 7). This decrease of variance in the intra-cluster pairwise distances is tested using the F-test statistic:
where SSR and SSCandidate are sums of squared differences from mean inside the cluster of genes covered by the parent rule R and by a refined Candidate rule, respectively, and values nR and nCandidate are the total number of genes in each of the two clusters. A p-value is calculated from the F score and used to determine the significance of change (the threshold, αF , defaults to 0.05). 씰Figure 1 shows an example of explored refinements during rule search that may lead to the identification of profile-coherent gene clusters. The resulting refined rules stored in the Candidates list are added to Beam (line 9), which retains at most L best-scored rules (line 10). Because the goal is to discover the most homogeneous clusters, each rule is scored according to the potential coherence of its corresponding sub-cluster potentially obtained after the refinement of the rule. Potential coherence estimates how promising the cluster is in terms of finding a good subset of genes. Methods Inf Med 3/2009
While examining all subgroups of genes in the cluster would be an option, such an estimate is computationally expensive because of potentially large number of subgroups. Instead, we define the potential coherence of a cluster as the average of k · N · (k · N – 1)/2 minimal pairwise profile distances. This in a way approximates a choice of a subset with k · N most similar genes. If the cluster being estimated contains less then k · N genes, its estimated potential equals to the average of all pairwise gene distances. Rules for which the above procedure finds no suitable refinements and whose intracluster pairwise distance is below a user-defined threshold D are added to Rules, the list that stores the terminal rules discovered by RBC algorithm (line 12). Note that a process of taking the best-scored rule from the Beam, refining it and adding newly found rules (if any) with improvements in intra-cluster profile distances is repeated until Beam is left empty. To further reduce the potentially large number of rules found by the beam search, RBC uses two post-processing steps (lines 13 and 14). RBC may infer rules that describe exactly the same cluster of genes. Each such rule set is considered individually, with the aim to retain only the most general rules from it. That is, for each pair of rules with conditions C1 and C2, only the first rule from the pair is retained in the rule set if its condition C1 subsumes condition C2 , that is, it covers the same genes but is more general in terms of logic. For instance, condition “S1” subsumes condition “S1 and S2”. The remaining list of Rules is further filtered by keeping only the most coherent rules so that on average no more than a limited number of rules describe any gene (parameter M set by the user, default is five). The final set of rules is formed by selecting the rules with lowest intracluster distance first, and adding them to the final set only if their inclusion does not increase the rule-coverage for any gene beyond M. Alternatively to considering all the genes in its input data, RBC can additionally deal with the information on a set of target genes for which the user wants to focus the analysis. Typically, target genes would comprise a subgroup of similarly annotated genes, or a subset of differentially expressed genes. If a target set is given, discovered rules are included in Beam
and in the final set only if they cover at least N target genes. Because the algorithm starts with one rule (line 1), which describes all genes, the discovered rules can cover genes outside the target set. The method is thus able to identify genes that were initially left out of a target set but should have been included based on their regulatory content and gene expression. The proposed rule-based clustering method was inspired by the beam-search procedure successfully used in a well known, supervised machine learning algorithm CN2 [15], and by an unsupervised approach of clustering trees developed by Blockeel et al. [16], but is in its implementation and application substantially different from both. CN2 infers rules that relate attribute-value based description of the objects to their discrete class, while clustering trees identify attributevalue based description of non-overlapping clusters of similar objects. RBC combines both approaches by using a beam search to infer symbolic descriptions of potentially overlapping clusters of similarly regulated genes. Compared to beam search in CN2, where the size of the beam is relatively small (10–20 best rules are most often considered for further refinements), RBC uses a much wider beam but also generates potentially overlapping rules in a single loop. In contrast, in CN2, only the best-found rule is retained, objects covered by it removed from the data, and the procedure is restarted until no objects to be explored remain. Similar to CN2, the essence of our algorithm is rule refinement, for which, in the area of machine learning, the beam search proved to be an appropriate heuristic method.
3. A Case Study and Experimental Validation We applied the proposed RBC method to data from a microarray transcription profiling study where budding yeast S. cerevisiae cells were induced to proliferate peroxisomes – organelles that compartmentalize several oxidative reactions – due to the cell’s regulated response to the exposure to oleic fatty acid (oleate) and to the absence of glucose, which causes peroxisome repression [17]. The transcriptional profile of each gene consists of six microarray measurements on oleate induction time course, and two measurements in © Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
Fig. 3 a) Gene network, where we connect genes from same rule, for the peroxisome data set (target genes in gray, genes outside target in black). It includes 114 target genes and 7 outside genes, which are clustered in six major groups. b) Group graph of the discovered 37 clusters (two groups are
“oleate vs. glucose” and “glucose vs. glycerol” growth conditions. In total, gene pairwise distance was calculated on gene expression profiles consisting of eight microarray measurements. We defined the pairwise distance function to be 1.0 – r, where r is the Pearson correlation between two gene profiles. © Schattauer 2009
connected if sharing a subset of genes). c and d) Inferred promoter structure and gene expression of the two sub-clusters forming the eight-gene cluster, marked “1” in Figure 3a (also shown as clusters “group 37” and “group 34” in Fig. 3b).
For the target group we selected a set of 224 genes identified by the study to have similar expression profiles to those of genes involved in peroxisome biogenesis and peroxisome function. The goal of our analysis was to further divide the target group into smaller subgroups of genes with common promoter structure and possibly identify genes that were
inadvertently left out of the target group but should have been included based on their expression and promoter structure similarity. We analyzed data on 2135 putative binding sites which were identified using a local sequence alignment software tool MEME [7]. We searched for presence of these binding sites in 1 Kb promoter regions taken upMethods Inf Med 3/2009
233
234
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
stream from the translation start site (ATG) for ~6700 genes. The search identified ~302,000 matches of putative binding sites that were then used to infer rules with RBC. The algorithm was run with the default values of parameters. Distances between binding sites were rounded to increments of 40 bases; the maximum possible range of 2 Kb (for the given promoter length, relative distances can be from –1 Kb to +1 Kb) was thus reduced to
hard problem due to combinatorial explosion. Exhaustive search for all possible rules composed of three binding sites with defined orientation (three possible values: positive, negative, no preference) and distance (distance range is reduced into 50 different values) would, for this case study, require checking a huge number of rules:
50 different values =
Our method checked 2 × 11× 109 of the most promising rules, or less than 0.00004% of the entire three-term rule space. The search took 40 minutes on a Pentium 4, 3.4 GHz workstation. This demonstrates RBC’s ability to efficiently derive potentially complex rules within reasonable time frame. To evaluate the predictive ability of the approach we used a data set on 1364 S. cerevisiae genes that includes accurate binding sites data for 83 transcription factors [18]. We modeled the regulatory region spanning from - 800 bp to 0 bp relative to ATG. Pairwise gene distance was calculated as the average pairwise distance across 19 gene expression microarray studies available at SGD’s Expression Connection data base (http://www.yeastgenome.org/). All genes were considered to be target genes. Fivefold cross-validation was used that randomly splits genes into five sets. Clustering and testing of the inferred rules was repeated five times, each time with a different set of genes for validation of a model constructed using the remaining four sets. Each discovered rule was tested on genes in the test set. If a rule matched the promoter region of a test gene, then we calculated the prediction error by calculating the distance between the true gene expression of the test gene and its predicted expression. When more than one rule could be applied to predict the expression of a test gene, the average prediction error was returned for that gene. Overall, the method successfully predicted the expression of 286 genes (21% of all genes considered), with an average cross-validation prediction error of 0.75. If we were to use “random” rules, which would randomly cluster genes into groups of the same size as those by inferred rules, we could expect the prediction error to be 0.96. We believe that the achieved prediction error is a good indication of the predictive quality of inferred rules.
. This largely
reduced the number of possible subintervals that needed to be considered during rule inference. The search returned 41 rules that described and divided 114 target genes (51% of target genes) into 37 subgroups (씰see Fig. 3b). No rule could be found to describe the remaining 110 target genes. Most of the discovered gene groups are composed of five genes with high pairwise intra-group correlation (above 0.927). Many genes are shared (overlap) between the 37 discovered groups, resulting in six major gene groups visible in 씰Figures 3a and 3b. Seven genes outside the target set were also identified by the method (marked in black in Fig. 3a). For example, the smallest eight-gene group in the top-left corner in Figure 3a includes two outsiders (INP53 and YIL168W – also named SDL1). Gene ontology annotation shows that INP53 is involved together with two target genes (ATP3 and VHS1) in the biological process phosphate metabolism. Gene SDL1 is annotated to function together with the group’s target gene LYS14 in the biological process amino acid metabolism and other similar parent GO terms (results not shown). Details on the promoter structure and gene expression are given in 씰Figures 3c and 3d. These examples confirm the method’s ability to identify functionally related genes that were not initially included in the target set. The majority of the discovered rules in the case study include conditions that are composed of three terms, describing the binding site’s orientation and distance relative to ATG or other binding sites. There is no general binding site that would appear in many rules; only two rules include the same binding site (results not shown). Exhaustive search of even relatively simple rules can quickly grow into a prohibitively Methods Inf Med 3/2009
4. Conclusion The proposed rule-based clustering method can efficiently find rules of gene regulation by searching for groups of similarly expressed genes and with similar structure of the regulatory region. Starting from a target set of genes of interest, the method was able to cluster them into subgroups. Concurrently, RBC may expand the target set by identifying other similarly regulated genes that were initially overlooked by the user. Rule-search is guided and is made efficient by the proposed search heuristics. An important feature of RBC is its ability to discover overlapping groups of genes, potentially indicating common regulation or function. The algorithm uses a number of parameters that essentially determine the size of the search space being examined. The default values provided with the algorithm were set according to particular characteristics of the domain (e.g., about 10,000 genes, small subset of genes sharing some motif pattern, most known patterns include from one to five motifs [19]). The choice of parameters also affects the run time, and the defaults were chosen to make implementation practical and to infer the rules within one hour of computational time on a standard personal computer. We have experimentally confirmed the ability of RBC algorithm with default settings to infer rules that describe a complex regulatory structure and which can be used to reliably predict gene expression from regulatory content. In contrast with other contemporary methods that mainly use information on the presence of binding sites, a principal novelty of our approach is the use of a rich descriptive language to model the promoter structure. The language can be easily extended to accommodate other descriptive features, such as chromatin structure, when such kinds of data become available on a genome-wide scale. To summarize and display the findings of the analysis at different levels of abstraction we have applied different visualizations, which proved useful for understanding and biological interpretation. We believe that the main application of RBC is an exploratory search for additional evidence that genes, in theoretically or experimentally defined groups, actually share a common regulatory © Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
mechanism. The biologist can then gain insight by looking at the presented evidence and can better decide which inferred patterns are worth testing in the laboratory.
Acknowledgments This work was supported in part by Program and Project grants from the Slovenian Research Agency (P2-0209, J2-9699, P1-0207) and by a grant from the National Institute of Child Health and Human Development (P01-HD39691).
References 1. Bellazzi R, Zupan B. Intelligent data analysis – special issue. Methods Inf Med 2001; 40 (5): 362–364. 2. Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, et al. A genomic code for nucleosome positioning. Nature 2006; 442 (7104): 772–778.
© Schattauer 2009
3. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004; 5 (4): 276–287. 4. Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell 2004; 117 (2): 185–198. 5. Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004; 22 (11): 1467–1473. 6. Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996; 24 (1): 238–241. 7. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994; 2: 28–36. 8. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005; 23 (1): 137–144. 9. Down TA, Bergman CM, Su J, Hubbard TJ. Largescale discovery of promoter motifs in Drosophila melanogaster. PLoS Comput Biol 2007; 3 (1): e7. 10. Bolshakova N, Azuaje F. Estimating the number of clusters in DNA microarray data. Methods Inf Med 2006; 45 (2): 153–157. 11. Rahnenfuhrer J. Clustering algorithms and other exploratory methods for microarray data analysis. Methods Inf Med 2005; 44 (3): 444–448.
12. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet 2002; 31 (4): 370–377. 13. Chiang DY, Brown PO, Eisen MB. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics 2001; 17 (Suppl 1): S49–55. 14. Pilpel Y, Sudarsanam P, Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 2001; 29 (2): 153–159. 15. Clark P, Nibblet T. The CN2 induction algorithm. Machine Learning 1989; 3 (4): 261–283. 16. Blockeel H, De Raedt L, Ramon J. Top-down induction of clustering trees. Machine Learning 1998. 17. Smith JJ, Marelli M, Christmas RH, Vizeacoumar FJ, Dilworth DJ, Ideker T, et al. Transcriptome profiling to identify genes involved in peroxisome assembly and function. J Cell Biol 2002; 158 (2): 259–271. 18. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006; 7: 113. 19. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, et al. Transcriptional regulatory code of a eukaryotic genome. Nature 2004; 431 (7004): 99–104.
Methods Inf Med 3/2009
235
236
© Schattauer 2009
Original Articles
Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers C. Bielza1; V. Robles2; P. Larrañaga1 1Departamento 2Departamento
de Inteligencia Artificial, Universidad Politécnica de Madrid, Spain; de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid, Spain
Keywords Logistic regression, regularization, estimation of distribution algorithms, DNA microarrays
Summary Objectives: The “large k (genes), small N (samples)” phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters. Methods: Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a
Correspondence to: Concha Bielza Facultad de Informática Campus de Montegancedo s/n 28660 Boadilla del Monte, Madrid Spain E-mail:
[email protected]
Methods Inf Med 3/2009
number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification. Results: The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast, Colon, Leukemia and Prostate. Details on the last two data sets are available as supplementary material. Conclusions: We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.
Methods Inf Med 2009; 48: 236–241 doi: 10.3414/ME9223 prepublished: March 31, 2009
1. Introduction The development of DNA microarray technology allows screening of gene expression levels from different tissue samples (e.g. cancerous and normal). The resulting gene expression data help explore gene interactions, discover gene functions and classify individual cancerous/normal samples, using different supervised learning techniques [1, 2]. Among these techniques, logistic regression [3] is widely used because it provides explicit probabilities of class membership, interpretation of the regression coefficients of predictor variables and it avoids gaussianity or correlation structure assumptions. Microarray classification is a challenging task since these data typically involve extremely high dimensionality (thousands of genes) and small sample sizes (less than one hundred cases). This is the so-called “large k (variables), small N (samples) problem” or the “curse of dimensionality”. This may cause a number of statistical problems for estimating parameters properly. First, a large number of parameters have to be estimated using a very small number of samples. Therefore, an infinite number of solutions is possible as the problem is undetermined. Second, multicollinearity largely exists. The likelihood of some gene profiles being linear combinations of other gene profiles grows as more and more variables are introduced into the model, thereby supplying no new information. Third, over-fitting may occur, i.e. the model may fit the training data well but perform badly on new samples. These problems yield unstable parameter estimates. Furthermore, there are also computational problems due to the large number of predictor variables. Traditional numerical algorithms for finding the estimates, like Newton-Raphson’s method [4], require prohibitive computa-
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
tions to invert a huge, sometimes singular matrix, at each iteration. To alleviate this situation within the context of logistic regression, many authors use techniques of dimensionality reduction and feature (or variable) selection [5]. Feature selection methods yield parsimonious models which reduce information costs, are easier to explain and understand, and increase model applicability and robustness. The goodness of a proposed gene subset may be assessed via an initial screening process where genes are selected in terms of some univariate or multivariate scoring metric (filter approach [6]). By contrast, wrapper approaches search for good gene subsets using the classifier itself as part of their function evaluation [7]. A performance estimate of the classifier trained with each subset assesses the merit of this subset. Imposing a penalty on the size of logistic regression coefficients is another different solution. Finding a maximum likelihood estimate subject to spherical restrictions on the logistic regression parameters leads to ridge or quadratic (penalized) logistic regression [8]. Therefore, the ridge estimator is a restricted maximum likelihood estimator (MLE). Shrinking the coefficients towards zero and allowing a little bias provide more stable estimates with smaller variance. Apart from ridge penalization, there are other penalties within the more general framework of regularization methods. All of them aim at balancing the fit to the data and the stability of the estimates. These methods are much more efficient computationally than wrapper methods with the similar performance. Furthermore, regularization methods are more continuous than usual discrete processes of retaining-or-discarding features thereby not suffering as much from high variability. Here we introduce estimation of distribution algorithms (EDAs) as natural regularizers within the logistic regression context. EDAs are a recent optimization heuristic included in the class of stochastic populationbased search methods [9]. EDAs work by constructing an explicit probability model from a set of selected solutions, which is then conveniently used to generate new promising solutions in the next iteration of the evolutionary process. An optimization heuristic is an appropriate tool since shaping the logistic © Schattauer 2009
classifier means estimating its parameters, which in turn entails solving a maximization problem. Unlike traditional numerical methods, EDAs do not require derivative information or matrix inversions. Moreover, used as fitness functions, EDAs could similarly maximize penalized likelihoods to tackle the k >> N problem. This would just reveal the potential of a heuristic (EDA) against a numerical (Newton-Raphson) method. In this paper we will show that the EDA framework is so general that, under certain parameterizations, it obtains the regularized estimates in a natural way, without penalizing the original likelihood. EDAs receive the unrestricted likelihood equations as inputs and they generate the restricted MLEs as outputs.
be the maximizer of l. Newton-Raphson’s algorithm is traditionally used to solve the resulting nonlinear equations. Other methods [10] are gradient ascent, coordinate ascent, conjugate gradient ascent, fixed-Hessian Newton, quasi-Newton algorithms (DFP and BFGS), iterative scaling, Nelder-Mead and random integration.
2.2 Regularized Approaches to Logistic Regression Ridge logistic regression seeks MLEs subject to spherical restrictions on the parameters. Therefore, the function to be maximized is the penalized log-likelihood given by (3)
2. Methods 2.1 Logistic Regression for Microarray Data Assume we have a (training) data set DN of N independent samples from microarray experiments DN = {(cj, xj1, ..., xjk), j = 1, ..., N}, where xj = (xj1, ..., xjk)t ∈ Rk is the gene expression profile of the j-th sample, xji indicates the i-th gene expression level of the j-th sample and cj is the known class label of the j-th sample, 0 or 1, for the different states. We assume the expression profile x to be preprocessed, log-transformed and standardized to zero mean and unit variance across genes. Let πj, j = 1, ..., N denote P(C = 1⏐xj), i.e. the conditional probability of belonging to the class state 1 given gene expression profile xj. Then the logistic regression model is defined as
where λ > 0 is the penalty parameter and controls the amount of shrinkage. λ is usually chosen by cross-validation. The cross-validation deviance, error, BIC or AIC are used as the criteria to be optimized. Let be the maximizer of Equation 3 or ridge estimator. This estimator always exists and is unique. In the field of microarray classification, Newton-Raphson’s algorithm may be employed but it requires a matrix of dimension k + 1 to be inverted. Inverting huge matrices may be avoided to some extent with algorithms like the dual algorithm based on sequential minimal optimization [11] or SVD [12]. Combined with SVD, [13, 14] use a feature selection method called recursive feature elimination (RFE) [15] that iteratively removes genes with smaller absolute values of . Within a broader context, log-likelihood can be penalized as
(1)
, where the
penalty function is generally The Ll penalty ψ (βi) =
where β = (β0, β1, ..., βk) denotes the vector of regression coefficients including intercept β0. From DN, the log-likelihood function is built as t
(2) where πj is given by (1). MLEs, , are obtained by maximizing l with respect to β. Let
|βi| results in lasso, introduced by [16] in the context of logistic regression. In a Bayesian setting, the prior corresponding to this case is an independent Laplace distribution (or double exponential) for each βi. Cawley and Talbot [17] even model the penalty parameter λ by using a Jeffreys’ prior to eliminate this parameter by integrating it out analytically. Although the objective function is still concave in lasso (as in ridge regression), an added Methods Inf Med 3/2009
237
238
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
computational problem is that this function is not differentiable. Generic methods for nondifferentiable concave problems, such as the ellipsoid method or subgradient methods, are usually very slow in practice. Faster methods have recently been investigated [18, 19]. Interest in lasso is growing because Ll penalty encourages the estimators be either significantly large or exactly zero, which has the effect of automatically performing feature selection and hence yielding concise models.
2.3 EDAs for Regularizing Logistic Regression-based Microarray Classifiers Among the stochastic population-based search methods, EDAs have recently emerged as a general framework that overcomes some weaknesses of other well-known methods like genetic algorithms [9]. Unlike genetic algorithms, EDAs avoid the ad hoc design of crossover and mutation operators, as well as the tuning of a large number of parameters, while they explicitly capture the relationships among the problem variables by means of a joint probability distribution (jpd). The main system underlying the EDA approach, which will be denoted Proc-EDA, is: 1. D0 ← Generate M points of the search space randomly 2. h = 1 3. do { 4. ← Select M ′ < M points of the search space from Dh – 1 5. ph(z) = p (z | )← Estimate the jpd from the selected points of the search space 6. Dh ← Sample M points of the search space (the new population) from ph(z) 7. } until a stopping criterion is met M points of the search space constitute the initial population and are generated at random. All of them are evaluated by means of a fitness function (step 1). Then, M ′(M ′ < Μ) points are selected according to a selection method, taking the fitness function into account (step 4). Next, a multidimensional probabilistic model that reflects the interdependencies between the encoded variables in these M ′ selected points is induced (step 5). The estimation of this underlying jpd represents the EDA Methods Inf Med 3/2009
bottleneck, as different degrees of complexity in the dependencies can be considered. In the next step, M new points of the search space – the new population – are obtained by sampling from the multidimensional probabilistic model learnt in the previous step (step 6). Steps 4 to 6 are repeated until some pre-defined stopping condition is met (step 7). Likewise other numerical methods (see above) as Nelder-Mead’s, EDAs work by simply evaluating the objective function at some points. However, Nelder-Mead’s algorithm is deterministic and evaluates the vertices of a simplex, while EDAs are stochastic, require a population and to learn/simulate models. If we confine ourselves to logistic regression classifiers, EDAs have been used for estimating the parameters from a multiobjective viewpoint [20]. EDAs could be successfully used to optimize any kind of penalized likelihood because, unlike traditional numerical methods, they do not require derivative information or matrix inversions. However, we investigate here a more interesting approach that shows that EDAs can act as an intrinsic regularizer if we choose a suitable representation. Thus, let us take l (β) (씰see Eq. 2) as the fitness function that assesses each possible solution β to the (unrestricted) maximum likelihood problem. β is a k + 1 dimensional continuous random variable. EDAs would start by randomly generating the initial population D0 of M points of the search space . After selecting M ′ points (e.g. the top M ′), the core of the EDA paradigm is step 5 above to estimate the jpd from these selected M ′ points. Without losing generality, we start from a univariate marginal distribution algorithm (UMDAcG) [21] in our continuous β-domain. UMDAcG assumes that at each generation h all variables are independent and normally distributed, i.e.
(4)
See [22] for the UMDAcG theoretical support. We now modify UMDAcG to tackle the regularized logistic regression by shrinking the βi parameters during the EDA simulation step. Specifically, we introduce a new algorithm UMDAcG * that learns a UMDAcG model given by (4) at step 5 and iteration h. This involves
estimating the new μih and σih with the MLEs computed on the selected set of M ′ points of the search space from the previous generation. However, sampling at step 6 now generates points from (4) with the normal distributions ph(βi) constrained to lie in an interval [–bh, bh]. This is readily achieved by generating values from a Gaussian of parameters μih and σih for each variable βi and constraining its outputs, according to a standard rejection method to fall within [–bh, bh]. The idea is that, as long as the algorithm progresses, forcing the βi parameters to be in a bounded interval around 0 constrains and stabilizes their values, just like regularization does. At step 5, we learn, for the random variable β, the multivariate Gaussian distribution with a diagonal covariance matrix that best fits, in terms of likelihood, the M ′ β-points that are top ranked in the objective function l (β). We then generate, at step 6, M new points from the previous distribution truncated at each coordinate at –bh (bottom) and at bh (top). New data are ranked with respect to their l (β) values, and the best M ′ are chosen and so on. In spite of optimizing function l (β) rather than another penalized loglikelihood function like e.g. ridge regression’s l*(β), the evolutionary process guarantees that the βi’s values belong to intervals of the desired size. Therefore, our estimates of βi are regularized estimates. In fact, we have empirically verified that the standard errors of our estimators are smaller than those of regularized approaches like ridge logistic regression and exhibiting less outliers than lasso. Moreover, since we use the original l (β) objective function of the logistic regression, we do not need to specify the λ parameter of other penalized approaches like (3). Note that plenty of probability models are possible in (4), without necessarily assuming all variables to be Gaussian and independent. Different univariate, bivariate or multivariate dependencies may be designed with the benefit of having an explicit model of (possible) complex probabilistic relationships among the different parameters. Traditional numerical methods are unable to provide this kind of information. Thus, the estimation of Gaussian network algorithm (EGNA) [21] models multivariate dependencies among βi by learning at each generation a nonrestricted normal density that maximizes the Bayesian information © Schattauer 2009
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
Fig. 1 Number of genes in set S vs. accuracy (%) and vs. bhop for Breast and Colon data sets
criteria (BIC) score. In EGNA, ph(β) factorizes as a Gaussian network [23]. The rationale for this assumption is in part justified by the fact that the MLEs asymptotically follow a multivariate normal distribution. However, in our case the number of observations N is small and, as mentioned above, we do not have MLEs either since our estimators are restricted MLEs. Finally, the last step, say at iteration h = T, would contain from which would be chosen as argmax the final regularized estimate of β.
2.4 Gene Selection Our EDA-based regularization is now embedded in a gene selection procedure. We propose it to take into account the strength of each gene i given by its regression coefficient βi and besides to automatically search for an optimal bh according to the classification accuracy of the associated regularized model. The general procedure, denoted Proc-gene, is: 1. For a subset of genes S, search for bh of EDA approach using the classification accuracy as the criterion. Let bhop be the optimal value. 2. With bhop fixed, eliminate a percentage of the genes with the smallest βι2 values. Let S be the new (smaller) set of genes. 3. Repeat steps 1 and 2 until there is only one gene left. An optimal subset of genes is finally derived. Some remarks follow. In step 1, subset S to initialize the process may be chosen in different © Schattauer 2009
ways. Basically, we can start with all the genes or we can use a filter approach to reduce the size of this subset. Since it is not clear which filter criterion to use and different filter criteria may lead to different conclusions, we propose here a kind of consensus among different filter criteria. Thus, for four filters f1, f2, f3 and f4, if gene i is ranked first by f1, second by f2, third by f3 and fourth by f4, then its rank aggregation would be 11. The top-ranked genes by this new agreement would be chosen. In our experiments we have used the following four filter criteria: 1) the BSS/WSS criterion (as in [24]), 2) the Pearson correlation coefficient to the class variable (as in [5, 25]), 3) a p-metric (as in [26]), and 4) a t-score. The search for the optimal bh for the EDA in step 1 amounts to running EDA (ProcEDA) several times (for different bh values) and measuring which of the fitted logistic regression models is the best. This is assessed by estimating the classifier’s accuracy (percentage of correctly classified microarrays) as the generalization performance of the model.
Braga-Neto and Dougherty [27] proved the .632 bootstrap estimator to be a good overall estimator in small-sample microarray classification, and it was therefore the chosen method in this paper. In step 2 of Proc-gene, EDA has already provided a fitted model (with the best bh value) and then a gene selection method inspired by RFE is carried out. As in [13, 14], we remove more than one feature at a time for computational reasons (the original RFE only removes one), based on the smallest βι2 values, indicators of a lower relative importance in the gene subset.
3. Results and Discussion We illustrate how our approach really acts as a regularizer on some publicly available benchmark microarray data sets. First, the Breast data set [25] with 7129 genes and 49 tumor samples, 25 of them representing estrogen receptor-positive (ER+) and the other 24
Table 1 Selected top 7 genes with their β estimate for Breast GenBank ID [ref.]
Description
β
X87212_at [25]
H. sapiens mRNA for cathepsin C
–6.988
L26336_at [32]
Heat shock 70kDa protein 2
L17131_ma1_at [16, 33]
Human high mobility group protein
–5.402
J03827_at
Y box binding protein-1 mRNA
–3.549
S62539_at [34]
Insulin receptor substrate 1
HG4716-HT5158_at [35]
Guanosine 5’-monophosphate synthase
U30827_s_at [25, 36]
Splicing factor, arginine/serine-rich 5
6.980
3.419 –2.685 2.480
Methods Inf Med 3/2009
239
240
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
Table 2 Selected top 9 genes with their β estimate for Colon GenBank ID [ref.]
Description
β
T94579 [38]
Human chitotriosidase precursor mRNA, complete cds
–0.500
D26129 [40]
Ribonuclease pancreatic precursor (human)
–0.500
T40578 [39]
Caldesmon 1
–0.499
R80427 [38]
C4-dicarboxylate transport sensor protein dctb (Rhizobium leguminosarum)
–0.497
Z50753 [38]
H.sapiens mRNA for GCAP-II/uroguanylin precursor
0.496
M76378 [38]
Human cysteine-rich protein (CRP) gene, exons 5 and 6
0.494
H06061 [38]
Voltage-dependent anion-selective channel protein 1 (Homo sapiens)
0.485
H08393 [38]
Collagen alpha 2(XI) chain (Homo sapiens)
0.482
T62947 [38]
60S ribosomal protein L24 (Arabidopsis thaliana)
being estrogen receptor-negative (ER–). Second, the Colon data set [28] that contains 2000 genes for 62 tissue samples: 40 cancer tissues and 22 normal tissues. Other public data sets have been studied: the Leukemia data set [29] and the Prostate cancer data set [30]. See the supplementary material on the web pagea. We have developed our own implementation in C++ for the EDA-based regularized logistic regression (Proc-EDA) and in R for the gene selection method (Proc-gene) that calls the former. We tried two different EDA approaches: UMDAcG and EGNA. To run EDAs we found that an initial population of at least M = 100 points and of at least M ′ = 50 selected points for learning guarantee robust β estimates. The relative change in the mean fitness value between successive generations was the chosen value for assessing the convergence of the Proc-EDA algorithm. As regards Proc-gene, we considered reasonable to initialize it with 500 genes for the size of subset S. These were selected according to the aggregation of the four filter criteria as described above. Based on our experience, a good choice in the experiments for the number of bootstrap samples used for training was 100. The percentage of genes to be removed in step 2 is fixed as 10%. 씰Figure 1a and 씰Table 1 show the experimental results on the Breast data set. Since perfect classification (100%) is
a
http://laurel.datsi.fi.upm.es/~vrobles/eda_lr_reg
Methods Inf Med 3/2009
–0.480
achieved with many different gene subsets, we choose the subset with fewer genes, i.e. the 7-gene model. Note how bhop obtained at step 1 of procedure Proc-gene varies as long as the number of selected genes changes due to the adapted RFE. Its minimum value is 0.5. Running times on an Intel Xeon 2GHz under Linux are quite acceptable: almost 3 minutes for 500 genes, 39 s for 250, between 2.5 and 5 s for 75–125 genes, and less than 2 s for 70 genes or fewer. The seven genes found to separate ER+ from ER– samples achieve a higher classification accuracy than other up-to-date regularized methods. Shevade and Keerthi [16] report an accuracy of 81.9% and use logistic regression with Ll penalty solved by the Gauss-Seidel method. They propose a different gene selection procedure and retain six genes, two of them also found by us (see below). Fort and Lambert-Lacroix [31] use a combination of PLS and ridge logistic regression to achieve an about 87.5% accuracy. They perform a gene selection based on the BSS/WSS criterion choosing some fixed number of genes: 100, 500, 1000, although they do not indicate which are they. Finally, a slightly different approach followed by the original paper by West et al. [25], where a probit (binary) regression model is combined with a stochastic regularization and SVDs, yields a 89.4% accuracy using 100 genes selected according to their Pearson correlation coefficient to the class variable. When our results are compared to the most popular regularization methods, lasso and ridge logistic
regressions only achieve 98.23% and 98.46% accuracies, respectively, using in both cases the same 500 selected genes provided by the aggregation of the four filter criteria. All of our seven selected genes have been linked with breast cancer proving the consistency of our results with the literature (see Table 1). 씰Figure 1b and 씰Table 2 show the results on the Colon data set. Classes are less well separated outputting at most a 99.65% accuracy, for the 9-gene model. Running times are longer than before: almost 10 minutes for 500 genes, 1.5 minutes for 250, between 2 and 7 s for 60–125 genes, and less than 2 s for 55 genes or fewer. An analysis of the selected genes and the accuracy reported by other directly related methods is as follows. Shevade and Keerthi [16] achieve an accuracy of 82.3% with eight genes, three of them – Z50753, T62947 and H08393 – included in our list. Liu et al. [37] use logistic regression with Lp penalty, where p = 0.1 and retain 12 genes. Genes Z50753, M76378 and H08393 of their list are also in ours. They do not compute the accuracy but the AUC (0.988), which in our case for the 9-gene model is better (0.9996). Using a ridge logistic regression approach, Shen and Tan [14] keep 16 genes with a similar RFE than in our case and report a 99.3% accuracy, without any mention to the specific genes selected. When our results are compared to lasso and ridge logistic regressions, these only achieve 89.74% and 90.51% accuracies, respectively, both lower than our 99.65% accuracy. Our 9-gene list includes genes identified as relevant for colon cancer in the literature (see Table 2). See the supplementary material for details on EGNA factorizations.
4. Conclusions The high interest of combining a regularization with a dimension-reduction step to enhance classifier efficiency has been pointed out elsewhere [31]. Combined with a gene subset selection procedure that adapts the RFE and automatically tunes the required parameters, we have introduced a novel EDAbased logistic regression regularizer. It includes the shrinkage of the coefficients implicitly during EDA evolution process while optimizing the usual likelihood function. The © Schattauer 2009
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
empirical results on several microarray data sets have provided models with a low number of relevant genes, most of them confirmed by the literature, and performing better in classification than other competing regularized methods. Unlike the traditional procedures for finding maximum likelihood βi parameters, the EDA approach is able to use any optimization objective, regardless of its complexity or the non-existence of an explicit formula for its expression. In this respect, our framework could find parameters that maximize the AUC objective (a difficult problem [41]) or it would also fit the search for parameters of any regularized logistic regression. The inclusion of interaction terms among (possibly coregulated) genes in ηj of expression (1) would also be feasible as other future direction to explore.
Acknowledgments The authors are grateful to the referees for their constructive comments. Work partially supported by the Spanish Ministry of Education and Science, projects TIN2007-62626, TIN2007-67148 and TIN2008-06815-C02 and Consolider Ingenio 2010-CSD200700018 and by the National Institutes of Health (USA), project 1 R01 LM009520-01.
References 1. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Briefings in Bioinformatics 2006; 17 (1): 86–112. 2. Dugas M, Weninger F, Merk S, Kohlmann A, Haferlach T. A generic concept for large-scale microarray analysis dedicated to medical diagnostics. Methods Inf Med 2006; 45 (2): 146–152. 3. Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd edn. New York: J. Wiley and Sons; 2000. 4. Thisted RA. Elements of Statistical Computing. New York: Chapman and Hall; 1988. 5. Markowetz F, Spang R. Molecular diagnosis classification, model selection and performance evaluation. Methods Inf Med 2005; 44 (3): 438–443. 6. Weber G, Vinterbo S, Ohno-Machado L. Multivariate selection of genetic markers in diagnostic classification. Artif Intell Med 2004; 31: 155–167. 7. Heckerling PS, Gerber BS, Tape TG, Wigton R. Selection of predictor variables for pneumonia using neural networks and genetic algorithms. Methods Inf Med 2005; 44 (1): 89–97. 8. Lee A, Silvapulle M. Ridge estimation in logistic regression. Comm Statist Simulation Comput 1988; 17: 1231–1257.
© Schattauer 2009
9. Lozano JA, Larrañaga P, Inza I, Bengoetxea E (eds). Towards a New Evolutionary Computation. Advances in Estimation of Distribution Algorithms. New York: Springer; 2006. 10. Minka T. A comparison of numerical optimizers for logistic regression. Tech Rep 758, Carnegie Mellon University; 2003. 11. Keerthi SS, Duan KB, Shevade SK, Poo AN. A fast dual algorithm for kernel logistic regression. Mach Learning 2005; 61: 151–165. 12. Eilers P, Boer J, van Ommen G, van Houwelingen H. Classification of microarray data with penalized logistic regression. In: Proc of SPIE. Progress in Biomedical Optics and Images, 2001. Volume 4266 (2): 187–198. 13. Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics 2004; 5: 427–443. 14. Shen L, Tan EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE Trans Comput Biol Bioinformatics 2005; 2: 166–175. 15. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learning 2002; 46: 389–422. 16. Shevade SK, Keerthi SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003; 19: 2246–2253. 17. Cawley GC, Talbot N. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006; 22: 2348–2355. 18. Koh K, Kim SY, Boyd S. An interior-point method for large-scale L1-regularized logistic regression. J Mach Learn Res 2007; 8: 1519–1555. 19. Krishnapuram B, Carin L, Figueiredo M, Hartemink A. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Trans Pattern Anal Mach Intell 2005; 27: 957–968. 20. Robles V, Bielza C, Larrañaga P, González S, OhnoMachado L. Optimizing logistic regression coefficients for discrimination and calibration using estimation of distribution algorithms. TOP 2008; 16: 345–366. 21. Larrañaga P, Etxeberria R, Lozano JA, Peña JM. Optimization in continuous domains by learning and simulation of Gaussian networks. In: Workshop in Optimization by Building and Using Probabilistic Models. Genetic and Evolutionary Computation Conference, GECCO 2000. pp 201–204. 22. González C, Lozano JA, Larrañaga P. Mathematical modelling of UMDAc algorithm with tournament selection. Behaviour on linear and quadratic functions. Internat J Approx Reason 2002; 31: 313–340. 23. Shachter R, Kenley C. Gaussian influence diagrams. Manag Sci 1989; 35: 527–550. 24. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77–87. 25. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001; 98 (20): 11462–11467. 26. Inza I, Larrañaga P, Blanco R, Cerrolaza A. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 2004; 31: 91–103.
27. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004; 20: 374–380. 28. Alon U et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide microarrays. Proc Natl Acad Sci USA 1999; 96: 6745–6750. 29. Golub TR et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1996; 286: 531–537. 30. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1: 203–209. 31. Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2005; 21: 1104–1111. 32. Rohde M, Daugaard M, Jensen MH, Helin K, Nylandsted J, Marja Jaattela M. Members of the heat-shock protein 70 family promote cancer cell growth by distinct mechanisms. Genes Dev 2005; 19: 570–582. 33. Chiappetta G, Botti G, Monaco M, Pasquinelli R, Pentimalli F, Di Bonito M, D’Aiuto G, Fedele M, Iuliano R, Palmieri EA, Pierantoni GM, Giancotti V, Fusco A. HMGA1 protein overexpression in human breast carcinomas: Correlation with ErbB2 expression. Clin Cancer Res 2004; 10: 7637–7644. 34. Sisci D, Morelli C, Garofalo C, Romeo F, Morabito L, Casaburi F, Middea E, Cascio S, Brunelli E, Ando S, Surmacz E. Expression of nuclear insulin receptor substrate 1 in breast cancer. J Clin Pathol 2007; 60: 633–641. 35. Turner GA, Ellis RD, Guthrie D, Latner AL, Monaghan JM, Ross WM, Skillen AW, Wilson RG. Urine cyclic nucleotide concentrations in cancer and other conditions; cyclic GMP: A potential marker for cancer treatment. J Clin Pathol 2004; 35 (8): 800–806. 36. Abba MC, Drake JA, Hawkins KA, Hu Y, Sun H, Notcovich C, Gaddis S, Sahin A, Baggerly K, Aldaz CM. Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression. Breast Cancer Res 2004; 6: 499–513. 37. Liu Z, Jiang F, Tian G, Wang S, Sato F, Meltzer SJ, Tan M. Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology 2007; 6: Article 6. 38. Furlanello C, Serafini M, Merler S, Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinform 2003; 4: 54. 39. Gardina PJ. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006; 7: 325. 40. Lin YM, Furukawa Y, Tsunoda T, Yue CT, Yang KC, Nakamura Y. Molecular diagnosis of colorectal tumors by expression profiles of 50 genes expressed differentially in adenomas and carcinomas. Oncogene 2002; 21: 4120–4128. 41. Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 2005; 21: 4356–4362.
Methods Inf Med 3/2009
241
242
© Schattauer 2009
Original Articles
Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens S. Andreassen1; A. Zalounina1; L. Leibovici2; M. Paul2 1Center
for Model-based Medical Decision Support, Aalborg University, Aalborg, Denmark; of Medicine E, Rabin Medical Center, Beilinson Hospital, Petah-Tiqva, Israel
2Department
Keywords Antimicrobial susceptibility, Dirichlet estimator, Brier score, cross-validation
Summary Objectives: Selection of empirical antibiotic therapy relies on knowledge of the in vitro susceptibilities of potential pathogens to antibiotics. In this paper the limitations of this knowledge are outlined and a method that can reduce some of the problems is developed. Methods: We propose hierarchical Dirichlet learning for estimation of pathogen susceptibilities to antibiotics, using data from a group of similar pathogens in a bacteremia database. Results: A threefold cross-validation showed that maximum likelihood (ML) estimates of susceptibilities based on individual pathogens gave a distance between estimates obtained from the training set and observed frequencies in the validation set of 16.3%.
Correspondence to: Alina Zalounina Center for Model-based Medical Decision Support Aalborg University Fredrik Bajers Vej 7 9220 Aalborg Denmark E-mail:
[email protected]
1. Introduction Antibiotic treatment of bacteremia relies on knowledge of the susceptibility of the infecting pathogen to antibiotics. At the onset of infection this information is typically not available and it must be assessed from prior knowledge of the probability that a given pathogen is susceptible to a given antibiotic. In practice there are limits to how well these probabilities can be known. Methods Inf Med 3/2009
Estimates based on the initial grouping of pathogens gave a distance of 16.7%. Dirichlet learning gave a distance of 15.6%. Inspection of the pathogen groups led to subdivision of three groups, Citrobacter, Other Gram Negatives and Acinetobacter, out of 26 groups. Estimates based on the subdivided groups gave a distance of 15.4% and Dirichlet learning further reduced this to 15.0%. The optimal size of the imaginary sample inherited from the group was 3. Conclusion: Dirichlet learning improved estimates of susceptibilities relative to ML estimators based on individual pathogens and to classical grouped estimators. The initial pathogen grouping was well founded and improvement by subdivision of the groups was only obtained in three groups. Dirichlet learning was robust to these revisions of the grouping, giving improved estimates in both cases, while the group-based estimates only gave improved estimates after the revision of the groups.
Methods Inf Med 2009; 48: 242–247 doi: 10.3414/ME9226 prepublished: April 20, 2009
Susceptibilities of bacteria to antibiotics differ between hospitals and estimation of susceptibilities from databases of in vitro susceptibilities must therefore be based on local data. Even for a department of microbiology serving a large hospital or several smaller hospitals, the number of positive blood cultures, i.e. the number of times bacteria can be isolated from the blood, is unlikely to be much greater than about 1000 per year. This effectively limits the size of local databases because
susceptibilities change over time. If we for the purpose of this discussion assume that data older than three years should be used with caution, then the effective upper limit on the size of the database is about 3000 bacterial isolates, distributed over about 100 pathogens. This is further aggravated because the susceptibilities for community-acquired and hospital-acquired infections are substantially different and therefore must be estimated separately. It is difficult to set a threshold for how large the sample should be to make the classical maximum likelihood (ML) estimate useful. If we consider a pathogen that has an estimated susceptibility of 70% to an antibiotic, then the standard deviation (SD), calculated based on the binomial distribution, of that estimate is 9% for a sample size N = 25 and 5% for N = 100. So it is probably safe to conclude that the lower limit for useful estimates is somewhere between N = 25 and N = 100. This obviously leaves a large fraction of the pathogens without useful ML estimates. The simplest solution is to group the pathogens, assuming that all pathogens within a group have identical susceptibilities. This is a fairly strong assumption, and this paper will explore if it is possible to find estimates of susceptibility that represent a middle ground between the two extremes mentioned above, either using estimates based on a single pathogen or using estimates based on a whole group of pathogens. Technically, the method will be based on hierarchical Dirichlet learning [1–4], that allows a systematic approach to strengthening sparse data with educated guesses. For example, for Proteus spp., which is one of seven members of the “Proteus group” of pathogens (씰see Table 2), an educated guess, in the absence of enough data, would be to assume that it resembles other members of the Proteus group in terms of susceptibility. The Dirichlet learning then provides a mechanism which allows the susceptibility estimates for Proteus spp. to deviate
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
from the susceptibilities of other bacteria belonging to the Proteus group, if and when data on the actual susceptibility of Proteus spp. to this antibiotic becomes available. The potential benefit of this idea will be evaluated by applying the proposed method to a bacteremia database and it will be assessed whether our method improves the estimate, relative to the ML estimate for single pathogens and the grouped estimate.
2. Materials and Methods
Table 1 A fragment of the bacteremia database showing 4 out of the 1556 isolates from hospitalacquired infections. Amongst other information, the database contains attributes (columns) specifying the name of the pathogen and the in vitro susceptibility (S = susceptible, R = resistant) to a total of 36 antibiotics, out of which only 3 are shown here. Pathogen
1. tobramycin
2. piperacillin
3. gentamycin
…
…
…
…
Proteus spp.
R
S
S
Proteus spp.
S
S
S
Proteus spp.
S
S
S
Proteus vulgaris
S
S
S
…
…
…
…
2.1 The Bacteremia Database and ML Estimates Prior probabilities used in the model were based on a bacteremia database collected at Rabin Medical Center, Beilinson Campus, in Israel during 2002–2004. The bacteremia database included 3350 patient- and episodeunique clinically significant isolates from blood cultures. We shall restrict our attention to the 1556 isolates from adults with hospitalacquired infections. These isolates were obtained from 76 different pathogens and each isolate was on average tested in vitro for susceptibility to 21 antibiotics (range 1–31) out of a total of 36 antibiotics. A fragment from the bacteremia database is shown in 씰Table 1. The bacteremia database provides the counts of susceptibilities (Mij) and the number of isolates tested (Nij) belonging to each pathogen for a range of antibiotics. The index i identifies the antibiotic and j identifies the pathogen. For example, 씰Table 2 shows the counts of susceptibility (M1j) and the number of isolates tested (N1j) for the antibiotic tobramycin (i = 1) and seven pathogens belonging to the Proteus group (j = 1, …, 7). Using these counts, ML estimates of susceptibility (MLij) were calculated. For example, the ML estimate for the susceptibility of Proteus spp. (j = 1) to tobramycin and its SD were obtained as ML11 = M11/N11 = 2/3 = 0.67 and SD =√ML11(1 – ML11)/N11= 0.27.
2.2 Hierarchical Dirichlet Learning over Groups of Pathogens Dirichlet learning is a Bayesian approach for estimation of the parameters in binomial (or © Schattauer 2009
multinomial) distributions. In this paper it will be assumed that a priori estimates of the parameters of the binomial distribution for susceptibility can be guessed from the susceptibilities averaged over pathogens that are assumed to be similar. It is assumed, that the a priori distribution of the parameter follow the conjugated prior of the binomial distribution, which is the Beta distribution (or the Dirichlet distribution for the multinomial distribution). In the TREAT project a decision support system for advice on antibiotic treatment has been constructed [5]. As part of this construction 40 such groups of pathogens with similar susceptibility properties have been identified by the clinicians based on the clinical knowledge. In 씰Table 3 the 76 different pathogens from the bacteremia database have been allocated to 26 of these groups. Assume that a group of n similar pathogens has been identified, the pathogens being indexed by j ∈ {1, …, n}. On a number of occasions the susceptibility of these pathogens to a certain antibiotic (indexed by i) has been tested, Ni1, …, Nin times respectively, with the counts Table 2 The counts of susceptibility, the number of isolates tested, the ML estimates and the Dirichlet estimators of susceptibility to tobramycin (i = 1) for seven pathogens belonging to the Proteus group
of susceptibility being Mi1, …, Min, respectively. The average susceptibility Pi of this group is: Pi =
, where Ni =
.
(1)
The ML estimator of susceptibility of a pathogen MLij = Mij /Nij is now replaced by the Dirichlet estimator: Pij = (βi + Mij)/(αi + Nij),
(2)
where βi and αi are imaginary counts, βi = αi × Pi representing positive outcomes in the binomial distribution and αi representing the imaginary sample size, inherited from the pathogen group. Thus, αi indicates how strong the confidence is in the a priori distribution of the parameters, and βi /αi can be used as the a priori estimate of the parameter of the binomial distribution, i.e. as an estimate of the susceptibility averaged over the pathogen group. We let all αi assume the value A, except that we impose an upper limit on each αi: αi = min (A, Ni),
(3)
Pathogen
j
M1j
N1j
ML1j
P1j
Proteus spp.
1
2
3
0.67
0.7
Proteus mirabilis
2
39
49
0.80
0.79
Proteus vulgaris
3
1
1
1
0.78
Proteus penneri
4
2
2
1
0.82
Morganella morganii
5
19
20
0.95
0.91
Providencia spp.
6
4
10
0.4
0.49
Providencia stuartii
7
5
14
0.36
0.44
72
99
0.73
0.73
Sum of Proteus group
Methods Inf Med 3/2009
243
244
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
Table 3 Allocation of pathogens to the groups. Subgroups of Acinetobacter, Citrobacter and Other Gram-negative pathogen groups are placed in boxes. 1. Acinetobacter
10. Gram Positive Rod pathogen
18. Pseudomonas
– Acinetobacter baumanii
– Bacillus spp.
– Pseudomonas aeruginosa
– Acinetobacter spp.
– Corynebacterium aquaticum
– Pseudomonas alcaligenes
– Acinetobacter johnsoni
11. Klebsiella
– Pseudomonas cepacia
– Acinetobacter junii
– Klebsiella oxytoca
– Pseudomonas fluorescens
– Acinetobacter lwoffi
– Klebsiella pneumoniae
– Pseudomonas mendocida
– Klebsiella spp.
– Pseudomonas putida
2. Campylobacter – Campylobacter spp. 3. Candida – Candida tropicalis 4. Citrobacter – Citrobacter diversus – Citrobacter koserii
12. Listeria – Listeria monocytogenes
19. Salmonella non-typhi
– Moraxella
– Salmonella enteritidis
– Moraxella lacunata
– Salmonella Group C
14. Other Gram Negative pathogen 20. Staphylococcus negative – Alcaligenes xylosoxidans
– Citrobacter spp.
– Methylobacterium mesophilicum – Stenotrophomonas maltophila
– Enterobacter aerogenes
– Brevundimonas vesicularis
– Enterobacter cloacae
– Chryseobacter meningosept.
– Enterobacter gergoviae
– Sphingomonas paucimobilis
– Enterobacter sakazakii
– Serratia fanticola
– Enterobacter spp.
– Serratia marcescens
6. Enterococcus – Enterococcus avium
– Serratia spp. 15. Other Gram Positive
– Staphylococcus coagulasenegative – Staphylococcus epidermidis 21. Staphylococcus positive – Staphylococcus coagulase-positive 22. Streptococcus Group A
23. Streptococcus Group B – Streptococcus Group B
– Streptococcus Bovis
– Enterococcus faecalis
– Streptococcus acidominimus
– Streptococcus Bovis I
– Enterococcus spp. 7. Eschericia coli – Eschericia coli 8. Gram Negative Anaerobe pathogen – Fusobacterium 9. Gram Positive Anaerobe pathogen – Peptostreptococcus
Methods Inf Med 3/2009
– Streptococcus pneumoniae 17. Proteus
– Streptococcus Bovis II 25. Streptococcus viridans – Streptococcus mitis
– Proteus spp.
– Streptococcus oralis
– Proteus mirabilis
– Streptococcus salivarius
– Proteus vulgaris
– Streptococcus viridans
– Proteus penneri
Dist =
, (4)
where N =
.
24. Streptococcus Group D
– Gemella spp.
16. Pneumococcus
To evaluate the quality of the estimates a threefold cross-validation procedure is applied. The three years of data are divided into three periods, each containing data from one year. In turn, one of the three periods is designated as the validation set and the other two periods are designated as the training set and used for calculation of the estimators. We wish to evaluate how well the Dirichlet estimator Pij , calculated from the training set, predicts Fij , the observed frequency of susceptibility, calculated from the validation set. Fij is calculated as Fij = Mij/Nij . For this purpose we define the distance measure:
– Streptococcus Group A
– Enterococcus durans
– Enterococcus faecium
2.3 Evaluation of the Quality of the Estimates
– Pseudomonas stutzerii
13. Moraxella
– Citrobacter freundii
5. Enterobacter
– Pseudomonas spp.
since it is not reasonable to let the imaginary sample size αi exceed the number of counts Ni actually available for the group. If A = 0, then the Dirichlet estimate becomes equal to the ML estimate. If A → ∞, then the Dirichlet estimate becomes equal to the grouped estimate Pi . In the next section it will be shown, that a “suitable” value for A can be determined empirically.
26. Streptococcus
– Morganella morganii
– Streptococcus constellatus
– Providencia spp.
– Streptococcus Group F
– Providencia stuartii
– Streptococcus Group G
This distance measure calculates the square distance between Pij and Fij , weighted by the relative frequency of the pathogen. It can be interpreted as the average distance between the estimate derived from the learning set and the observed frequency in the validation set. It is a modified version of the Brier score [6] and algebraically it is easy to prove that any set of estimated Pij s that minimize Dist also minimize the Brier score. The procedure followed in the threefold cross-validation described above is graphically illustrated in 씰Figure 1. Dist measures the averaged distance between the Dirichlet estimator from the training set and the observed frequency in the validation set. Since Pij is a function of A (see Eqs. 2–4), Dist is also a function of A. The value of A which minimizes Dist is the optimal size of the imaginary sample to be inherited from a pathogen group to individual pathogens. © Schattauer 2009
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
3. Results 3.1 Comparison of ML Estimates for Individual Pathogens and Grouped Estimates Only 10 of the pathogens in the bacteremia database (13% out of 76) have been isolated more than 50 times. The counts available for estimation of susceptibility are even smaller, because susceptibility is only tested for a selection of antibiotics. This indicates, that the ML estimates of susceptibility for most combinations of pathogens and antibiotics in this database are too uncertain to be useful. When averaged over all pathogens and antibiotics the distance between the estimated susceptibilities based on individual pathogens and the observed frequency was 16.3% (Dist = 16.3%). If the estimates based on individual pathogens were replaced by estimates based on the groups of pathogens given in Table 3, then the distance between the estimators and the observed frequencies rose to 16.7%.
3.2 Hierarchical Dirichlet Learning over Groups of Pathogens We shall now explore hierarchical Dirichlet learning over groups of pathogens, and we initially consider the Proteus group mentioned above, which has seven members (씰see Table 2). To illustrate Dirichlet learning of the susceptibility of a single pathogen to a single antibiotic, let us consider learning susceptibility of Proteus spp. to tobramycin using susceptibility data available for other members of the Proteus group. (The procedure can be applied to any of seven pathogens in the Proteus group.) First we assume a value for A, e.g. A = 4. This gives α1 = min (4, 99) = 4, because for the Proteus group N1 = 99 (see Table 2). The average susceptibility of the group is P1 = 72/99 = 0.728. Next we calculate β1 = α1 × P1 = 4 × 0.728 = 2.91. Finally we can calculate the Dirichlet estimator as P11 = (β1 + M11)/(α1 + N11) = (2.91 + 2)/(4 + 3) = 0.70. This result along with the ML estimator and the Dirichlet estimator for the remaining members of the Proteus group are shown in Table 2, assuming that A = 4. © Schattauer 2009
Fig. 1 The procedure followed in the threefold cross-validation
An optimal value for A can be determined empirically by minimizing the distance in Equation 4. We have applied the distance measure for tobramycin across the Proteus group (the summation in Eq. 4 was performed across one antibiotic and seven pathogens in the Proteus group). It was found that the distance reaches its minimum (20.2%) at A = 4 (씰Fig. 2a), which is therefore the optimal imaginary sample size to be used for calculation of the Dirichlet estimator. Note, that the maximum value of Dist (25.8%) is observed at A = 0 and corresponds to the distance achieved by the ML estimator. The distance corresponding to the grouped estimator is observed at A → ∞ and is equal to 25.2%. Next we apply the same method to the Proteus group of pathogens, but averaged over all antibiotics. The result is given in 씰Figure 2b, where it can be seen that for the
Proteus group the susceptibility estimates based on individual pathogens give Dist = 22.4% (value of Dist for A = 0). The estimates based on the entire Proteus group give Dist = 22.7% (value of Dist for A → ∞). The lowest value, Dist = 20.9%, is obtained for A = 2. Finally the method is applied to all pathogen groups across all antibiotics. As mentioned above, the individual and group-based estimates give Dist of 16.3% and 16.7%, respectively, and from 씰Figure 3a (the full curve) it can be seen that the smallest value, Dist = 15.6%, is obtained for A = 1.
3.3 Revision of Groups of Pathogens The value of the groups and of the groupbased Dirichlet estimates depends on the quality of the groups. We therefore explored
Fig. 2 The distance measure Dist as a function of A for the Proteus group and a) tobramycin; b) all antibiotics. The filled circles represent the distances corresponding to A → ∞. Methods Inf Med 3/2009
245
246
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
Fig. 3 The results of the Dirichlet learning applied to all pathogens in the database
whether dividing some of the pathogen groups into smaller subgroups might improve the estimates. Out of the 26 groups represented in the database 16 groups were considered not eligible for subdivision, either because the group consisted of a single pathogen (n = 1) or because the number of isolates in the group was very small (Ni < 10). The remaining 10 groups were divided into three categories, depending on whether the optimal value of A for each group was 0 < A < ∞, A → ∞ or A = 0. These three categories are considered in more detail in the following.
0< A <∞ For seven groups (Coagulase-negative Staphylococcus (n = 3), Proteus (n = 7), Pseudomonas (n = 9), Streptococcus viridans (n = Methods Inf Med 3/2009
5), Enterococcus (n = 5), Citrobacter (n = 4) and Other Gram Negative pathogens (n = 18)) the Dirichlet learning has minimum Dist with 0< A < ∞. These groups were checked for similarity between the class members by the Mantel-Haenszel method across all antibiotics tested. All groups, except Coagulasenegative Staphylococcus, had pathogens with susceptibility significantly different from the rest of the group (p < 0.01). Each of those six groups was split into two or three more homogeneous groups. For Proteus, Pseudomonas, Streptococcus viridans and Enterococcus groups the reduction of Dist resulting from such split was rather modest, prompting us to keep the original definitions of the groups as can for example be seen for the Proteus group in 씰Figure 3b. Here the full curve, expressing the Dirichlet estimators derived using the original 7-member Proteus group,
is very close to the broken curve for the Dirichlet estimators derived using a 5-member Proteus subgroup and a 2-member Providencia subgroup. But for the Citrobacter and Other Gram Negative pathogen groups the results of the split give considerable advantage. For example, the measure Dist calculated after splitting Citrobacter group into two subgroups (the broken curve in 씰Fig. 3c) is substantially smaller than Dist calculated for the original 4-member Citrobacter group (the smooth curve in Fig. 3c).
A→∞ For the Enterobacter (n = 5) group the distance Dist decays continuously for increasing A, indicating very good match between the members of this group.
© Schattauer 2009
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
A=0 For the Acinetobacter (n = 5) and Klebsiella (n = 3) groups the minimum value of Dist was obtained for A = 0, indicating poor matching of the pathogens within a group. The split into new groups led to the conclusion to keep the original definition for Klebsiella (almost no difference between the curves in 씰Fig. 3d) and to split the Acinetobacter group into two subgroups (씰Fig. 3e). Based on these considerations it was decided to subdivide the Citrobacter, Other Gram Negatives and Acinetobacter groups into two, three and two subgroups, respectively. These subgroups are marked in 씰Table 3. The distance for the individual ML estimators is of course not affected by the subdivision (Dist = 16.3%), but the group-based estimators have a reduced Dist = 15.4%, compared to Dist =16.7% before the split. The effect on the Dirichlet learning of the subdivisions across all pathogen groups is shown in Figure 3a (the broken curve). The optimal value of A is A = 3 (compared to A = 1 in the case of the original 26 groups) and this gives Dist = 15.0%, marginally smaller than the value Dist = 15.6% obtained before the split.
3. Discussion The steady decrease of bacterial susceptibility to antibiotics due to the use of antibiotics places an upper limit on the practical size of the databases from which the susceptibilities are estimated. Grouping of pathogens into groups with similar susceptibilities to antibiotics may be a useful strategy in the sense that it reduces the burden of remembering susceptibilities considerably and that it provides reasonable estimates for pathogens with very small counts. However, the results showed that under realistic assumptions about the upper limit on the number of bacterial isolates in the database these advantages came at the expense of a modest reduction of the accuracy of the estimates of susceptibility. An improvement of the estimates could be obtained by hierarchical Dirichlet learning, where the estimate is based both on data for the individual pathogen and for the group of pathogens. Based on the results from the database it seems that for 23 of the 26 groups of pathogens © Schattauer 2009
represented in the database, the grouping inherited from the TREAT project could not be substantially improved. For three groups, Citrobacter, Other Gram Negatives and Acinetobacter, some improvement in the estimates could be achieved by splitting the groups into subgroups. After splitting these groups the estimators based on pathogen groups were actually better than the estimators based on individual pathogens, reflecting the better match of pathogens within the subgroups. A further improvement in the estimates was obtained by hierarchical Dirichlet learning, where the optimal size of the imaginary sample inherited from the group was 3. One may argue that this is a relatively low number. The reason for this is at least partially that the size of the imaginary sample is limited to the size of the group. Since many groups have small counts, this reduces the effect of increasing A, because α remains small (see Eq. 3). However, even a value of 3 or 4 would considerably stabilize estimates for small pathogens. There is an element of regression towards the mean in the proposed method – and this applies both to the additional grouping and (to a smaller extent) to grouping with additional Dirichlet learning. It is difficult to argue that this is unconditionally a bad thing from a clinical point of view – for example it could be argued that conservative estimates of susceptibilities for “high susceptibility” antibiotics may encourage caution in the prescribing of these. However, an important consideration is that Dirichlet learning is not only about (marginally) improving the statistics. It is also about providing some kind of credible estimate for rare pathogens. As an extreme (but not very rare) situation, consider the situation where a pathogen has been identified in a sample from a patient, but no prior data exists on the susceptibility of this pathogen. It is not acceptable not to treat the patient, because we do not have estimates for the susceptibilities, and grouping with or without Dirichlet provides a way out in most of these situations. The choice of the parameters (A and the group revisions) was performed looking at the results of the threefold cross-validation. Such design gives optimistically biased estimates of classification accuracy. A fair comparison needs an external test set, or the choice of the best value for A using only the training set in the cross-validation schema
[7]. Unfortunately, the data set does not get any bigger than what we already have, and further subdivision into training, validation and test sets is thus not attractive. We emphasize that the formation of further groups is only a hypothesis that eventually must be tested on a new dataset and that the values for the loss function Dist is optimistic, i.e. represents training, not generalization accuracy. It can be concluded that grouping of pathogens is a useful strategy. Grouping reduces the cognitive load of remembering susceptibilities and if the groups are carefully defined it improves the accuracy of the estimates of susceptibility. Indications for further subdivision of groups was only found in three groups and it can therefore be concluded that most of the pathogen groups originally defined by clinicians were well motivated. Both with and without further subgrouping, hierarchical Dirichlet learning allowed further improvement of the estimates and the Dirichlet learning turned out to be more robust against suboptimal grouping of the pathogens, as it was able to improve estimates both for the original grouping of pathogens and for the revised grouping.
References 1. Andreassen S, Kristensen B, Zalounina A, Leibovici L, Frank U, Schønheyder H. Hierarchical Dirichlét learning – filling in the thin spots in a database. In: Proceedings of the 9th Conference on Artificial Intelligence in Medicine; 2003 Oct; Cyprus. Springer; 2003. pp 274–283. 2. Heckerman D. Tutorial on Learning with Bayesian Networks. In: Jordan M, editor. Learning in Graphical Models. Cambridge, MA: MIT Press; 1999. 3. Filho J, Wainer J. Using a hierarchical Bayesian model to handle high cardinality attributes with relevant interactions in a classification problem. In: Proceedings of the 12th Iinternational Joint Conference on Artificial Intelligence; Jan 2007; Hyderabad, India. pp 2504–2509. 4. Cestnik B. Estimating probabilities: A crucial task in machine learning. In: Proceedings of the 9th European Conference on Artificial Intelligence; 1990; Stockholm, Sweden. pp 147–149. 5. Andreassen S, Leibovici L, Paul M, Nielsen A, Zalounina A, Kristensen L, Falborg K, Kristensen B, Frank U, Schønheyder H. A probabilistic network for fusion of data and knowledge in clinical microbiology. In: Husmeier, Dybowski, Roberts, editors. Probabilistic Modeling in Bioinformatics and Medical Informatics. London: Springer; 2005. pp 451–72. 6. Brier G. Verification of forecasts expressed in terms of probability. Monthly Weather Rev 1950; 78: 1–3. 7. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Heidelberg: Springer; 2001.
Methods Inf Med 3/2009
247
248
© Schattauer 2009
Original Articles
DCE-MRI Data Analysis for Cancer Area Classification U. Castellani1; M. Cristani1; A. Daducci2; P. Farace2; P. Marzola2; V. Murino1; A. Sbarbati2 1Department
of Computer Science, University of Verona, Verona, Italy; of Morphological and Biomedical Sciences, Anatomy and Histology Section, University of Verona, Verona, Italy 2Department
Keywords DCE-MRI, cluster analysis, classification, SVM
Summary Objectives: The paper aims at improving the support of medical researchers in the context of in-vivo cancer imaging. Morphological and functional parameters obtained by dynamic contrast-enhanced MRI (DCE-MRI) techniques are analyzed, which aim at investigating the development of tumor microvessels. The main contribution consists in proposing a machine learning methodology to segment automatically these MRI data, by isolating tumor areas with different meaning, in a histological sense. Methods: The proposed approach is based on a three-step procedure: i) robust feature extraction from raw time-intensity curves, ii) voxel segmentation, and iii) voxel classification based on a learning-by-example approach. In the first step, few robust features that compactly represent the response of the tissue to the DCE-MRI analysis are computed.
Correspondence to: U. Castellani Department of Computer Science University of Verona Strada le Grazie 15 37134 Verona Italy E-mail:
[email protected]
1. Introduction Machine learning techniques are becoming important to support medical researchers in analyzing biomedical data. For instance, in the context of cancer imaging, methods for the automatic isolation of interest areas characterized by different tumoral tissue development are crucial for diagnosis and therapy assessment [1]. In this paper, morphological Methods Inf Med 3/2009
The second step provides a segmentation based on the mean shift (MS) paradigm, which has recently shown to be robust and useful for different and heterogeneous clustering tasks. Finally, in the third step, a support vector machine (SVM) is trained to classify voxels according to the labels obtained by the clustering phase (i.e., each class corresponds to a cluster). Indeed, the SVM is able to classify new unseen subjects with the same kind of tumor. Results: Experiments on different subjects affected by the same kind of tumor evidence that the extracted regions by both the MS clustering and the SVM classifier exhibit a precise medical meaning, as carefully validated by the medical researchers. Moreover, our approach is more stable and robust than methods based on quantification of DCE-MRI data by means of pharmacokinetic models. Conclusions: The proposed method allows to analyze the DCE-MRI data more precisely and faster than previous automated or manual approaches.
Methods Inf Med 2009; 48: 248–253 doi: 10.3414/ME9224 prepublished: March 31, 2009
and functional parameters obtained by a dynamic contrast-enhanced MRI (DCE-MRI) acquisition system are analyzed by combining clusteringa and classification techniques [2].
a
In the following, we adopt the terms segmentation and clustering associating them the same meaning, i.e., the one of a consistent partition of data into classes with high inter-class variance and low intraclass variance [2].
DCE-MRI techniques represent noninvasive ways to assess tumor vasculature, based on dynamic acquisition of MR images after injection of suitable contrast agents and subsequent voxel-by-voxel quantitative analysis of the signal intensity time curves. Our method extends our previous work proposed in [3], and brings two advantages to the current state of the DCE-MRI analysis. First, it allows a more stable and robust feature extraction step from DCE-MRI raw data. In fact, as highlighted in [4–6], the standard quantification of DCE-MRI data by means of pharmacokinetic models [7] suffers from large output variability, which is a consequence of the large variety of models employed. Here, we propose to work directly on the raw signals by extracting few and significative features which robustly summarize the time-curve shape of each voxel. Second, we focus on the automation of the whole dataanalysis process by exploiting the effectiveness of the machine learning techniques on the proposed applicative scenario. A threestep procedure is introduced: i) signal feature extraction, ii) automatic voxel segmentation, and iii) voxel classification. In the first step, few compact features are computed, without the need of a free parameter tuning procedure. Note that the same features are used for all subjects and for all kinds of tumor. In the second step, the subject voxels are clustered basing on the features previously extracted, by adopting the mean shift clustering [8]. Although the MS clustering approach allows a precise data segmentation, it requires a careful tuning of a free parameter, namely the bandwidth [8]. For this reason, we propose to estimate such a parameter on a small subset of the subjects, being supported by the medical researchers that validate the segmentations. Then, in the third step, a classifier is trained to classify the tumoral regions according to the previously validated clustering results. A support vector machine (SVM) [9]
U. Castellani et al.: DCE-MRI Data Analysis for Cancer Area Classification
is applied as classifier. In particular, voxels of the same cluster are fed with the same label into the classifier. In this fashion, the SVM becomes able to perform segmentations on new unseen subjects with the same kind of tumor. In a previous paper [3], we proposed the introduction of the MS clustering on the DCE-MRI data of tumoral regions. In that case, we focused on standard tumor microvessel parameters, such as transendothelial permeability (k PS) and fractional plasma volume ( f PV), obtained voxel-by-voxel from intensity time curves. In this paper, we are inspired by recent works on the use of machine learning techniques for DCE-MRI tumor analysis [6, 10–12]. In [6] the curve patterns of the DCE-MRI pixels are analyzed in the context of musculoskeletal tissue classification. Several features are extracted to represent the signal shape such as the maximum signal intensity, the largest positive signal difference between two consecutive scans, and so on. Then, the classification is carried out by introducing a thresholding approach. In [10] the authors proposed the use of the MS algorithm [8] for the clustering of breast DCE-MRI lesions. In particular, pixels are clustered according to the area under the curve feature. Since the results are over-segmented, an iterative procedure is introduced to automatically select the clusters which better represent the tumor. In [11] a learningby-example approach is introduced to detect suspicious lesions in DCE-MRI data. The tumoral pixels are selected in a supervised fashion and fed to a SVM which is trained to perform a binary classification between healthy and malicious pixels. The raw n-dimensional signal is used as multidimensional vector. In [12] a neural network was applied on dynamic contrast agent MRI sequences as a nonlinear operator in order to enhance differences in the signal courses of pixels of normal and injured tissues. In this paper we emphasize the use of machine learning techniques as mean to produce stable and meaningful segmentation results in an automatic fashion. Indeed, the proposed approach permits to fasten the analysis itself, ensuring a higher throughput that turns out to be useful in the case of massive analysis.
2. The DCE-MRI Experimental Setup The main purpose of DCE-MRI analysis is to accurately monitor the local development of cancer, eventually subject to different treatments. Tumor growth is critically dependent on the capacity to stimulate the development of new blood vessels (angiogenesis), which in turn provides the tumor tissue with nutrients. In consequence, various angiogenesis inhibitors have been developed to target vascular endothelial cells and to block tumor angiogenesis [13]. The traditional criteria to assess the tumor response to treatment is based on the local measurement of tumor size change [13]. But such methods of testing cytotoxic compounds might not be adequate for antiangiogenesis drugs, which are in fact mainly cytostatic, slowing or stopping tumor growth. Moreover, the vascular effect of antiangiogenesis drugs may precede, by a remarkably long time interval, the effect on tumor growth. Consequently a different and more appealing indicative symptom of the cancer development has been analyzed, i.e. the tissue vascularization [14]. DCE-MRI techniques play a relevant role in this field [14]. The final aim is to provide quantitative measures that indicate the level of vascularization in the cancer tissue, eventually treated with antiangiogenic compounds, in a noninvasive way. The standardb DCE-MRI analysis can be divided in the following steps: 1) injecting contrast agents in the subject being analyzed; 2) acquiring MRI image sets of different slices of the tissues of interest; 3) extracting morphological and functional parameters such as fractional plasma volume ( f PV) and transendothelial permeability (k PS), that model the tissue vascularization; in practice, to each point of the MRI image is associated a couple of f PV and k PS values; 4) manually selecting a Region Of Interest on the MRI slices, in order to isolate the highly vascularized local tumoral area; 5) averaging the values of f PV and k PS in such an area, obtaining for each slice a couple of f PV and k PS mean values that indicate the overall level of vasculariza-
b
© Schattauer 2009
The procedure listed above comes from the investigation detailed in [13], that in turn presents additional similar researches.
tion. Even if the use of f PV and kPS parameters is employed in recent researches [13], such standard tumor microvessel parameters, based on the definition of a particular pharmacokinetic model, suffer from large output instability [4–6]. In this paper, we strongly improve the classical DCE-MRI analysis, providing an automatic method of tumoral tissue classification; the proposed technique is applied to this particular kind of analysis, but we suppose it can also be applied in general in the DCEMRI context. In detail, we change steps 3, 4 and 5; our method takes as input the raw DCE-MRI signals; in an automatic fashion, it is able to segment areas that correspond to the tumoral area traditionally extracted by hands in step 4, driven by histological and physiological a-priori considerations.
3. Proposed Method The proposed methodology is based on three main steps: i) signal feature extraction, ii) MS clustering, and iii) SVM classification. In the first step we extract standard curve parameters [6, 10]. The aim is to define a compact representation of the signal curve shape of each voxel, which effectively summarize the expected behavior by medical researchers. In the second step we choose the MS method since it effectively performs a clustering of multidimensional data lying on regular grid (i.e., the image) by combining spatial and feature relations into the same framework [8]. After the previous step, few stable and meaningful regions are extracted. In order to improve the automation of the proposed system, the data (i.e., voxel) segmentation can be treated as a classification problem on which a classifier is trained to distinguish among the regions extracted by the clustering. In particular, the number of regions is fixed according to the expectations of the medical researchers . Therefore, in the third step we select the SVM as classifier. We highlight that SVM are particular suitable for our method since we naturally define a n-dimensional feature vector for each voxel, as required by the SVM framework. Moreover, SVM have already shown their efficacy on several domains, by performing a data-driven classification while being able to effectively generalize the results [2]. Methods Inf Med 3/2009
249
250
U. Castellani et al.: DCE-MRI Data Analysis for Cancer Area Classification
Fig. 1 Signal feature extraction: TTP, AUC, AUCTTP , and WR (see text)
3.1 Signal Feature Extraction From the raw DCE-MRI signals, few and stable features are extracted. For each voxel, the time-intensity curve is divided by the precontrast signal intensity value in order to normalize signal intensities out of the scanner. Furthermore, data is filtered with a smoothing function to minimize errors due to outliers collected during the feature extraction step. More in details, the extracted features are: ● Time to peak (TTP) is the time interval between contrast injection and the time of maximum of signal intensity (SI). ● Area under the curve (AUC) is the integral of the time-intensity curve. ● AUCTTP is the integral of the timeintensity curve between contrast injection and the time of maximal signal intensity. ● Washout rate (WR) is the mean approximate derivative of the last part of the timeintensity curve.
Fig. 2 Equation 2 Methods Inf Med 3/2009
Note that proposed features depend only on the time signal observed on a single voxel being independent by the respective contextual neighborhood. In order to give the same weight to all of these features during the clustering step, a standardization procedure is performed, i.e., the range of each feature is normalized to the unit interval. 씰Figure 1 shows a scheme of the visual meaning of the extracted features.
3.2 Mean-shift Clustering The theoretical framework of the Mean Shift (MS) [8] arises from the ParzenWindows technique [2], that, in particular hypotheses of regularity of the input space (such as independency among dimensions [8]), estimates the density at point x as: (1) where d indicates the dimensionality of the data processed, n is the number of points
available, and k(·) is the kernel profile that models how strongly the points are taken into account for the estimation, in dependence with their distance to x, influenced by the h term. Finally, ck, d is a normalizing constant, depending on the dimensionality of the data and on the kernel profile. MS extends this “static” expression, differentiating (1) and obtaining the gradient of δk(x) . the density (씰see Fig. 2), where g(x) = δx In Equation 2, the first term in square brackets is proportional to the normalized density gradient, and the second term is the mean shift vector Mv (x), that is guaranteed to point towards the direction of maximum increase in the density [8]. Therefore, the MS vector can define a path leading to a stationary point of estimated density. The modes of the density are such stationary points. More in details, starting from a point x in the feature space, the mean shift procedure consists in calculating the mean shift vector at x, which will head to location y(1); this process is applied once again to y(1), producing location y(2) and so on, until a convergence criterion is met, and a convergence location y is reached. The mean shift procedure is guaranteed of being convergent [8]. In the MS-based clustering, from here simply MS clustering, the first step is made by applying the MS procedure to all the points {xi}, producing the convergency points {yi}. A consistent number of close convergency locations {yi}l indicates a mode μl . The clustering operation consists in marking the corresponding points {xi}l that produces the set {yi}l with the label l. This happens for all the convergency locations l = 1, 2, …, L. In this clustering framework, the only interventions required by the user involve the choice of the kernel profile k (·) and the choice of bandwidth value h. As usual, the Epanechnikov kernel is adopted as kernel profile [8]. Note that, in this fashion, the meaning of the kernel bandwidth parameter is more intuitive. In fact, the kernel bandwidth parameter regulates the level of detail with which the data space is analyzed; a large bandwidth means general analysis (few convergence locations), while a small bandwidth leads to a finer analysis (many convergence locations). Here, we use the MS algorithm to the 4dimensional space defined by the signal fea© Schattauer 2009
U. Castellani et al.: DCE-MRI Data Analysis for Cancer Area Classification
Fig. 3 Segmentation comparisons. On the left, the segmentation built on the kPS and f PV parameters; on the right, our segmentation. In the small boxes, we plot the mean signal (solid line), the median signal (dot-solid line), and the variance (dashed line) of the signal, respectively.
ture extraction. Since the bandwidth selection is crucial to find the correct segmentation (in the histological sense), we are supported by the medical researchers in this phase. Note that also subjects with the same tumor need different settings of the bandwidth. Therefore, we apply the MS clustering only to a subset of subjects with the same kind of tumor. Once the medical researchers have validated the clustering results, we use a classifier to distinguish the different kinds of tumoral tissues being more suitable to generalize the results [2].
3.3 SVM Classification The involved classifier is the binary support vector machine (SVM) [9]. SVM constructs a maximal margin hyperplane in a high-dimensional feature space, by mapping the original features through a kernel function. Since the radial basis function (RBF) kernel has been used, two parameters C and γ needed to be estimated. According to suggestions reported in [15], data are normalized properly and parameters are estimated by combining grid search with leave-one-out cross-validation [2]. In order to extend the SVM to a multi-class framework, the oneagainst-all approach is carried out [2]. As © Schattauer 2009
mentioned above, in our framework such learning-by-example approach is introduced to better generalize the results. In fact, the SVM is able to automatically detect the most discriminative characteristics of the detected clusters. Moreover, the training phase is intuitive and the testing (i.e., the classification) is faster than the clustering itself.
4. Results The experiments performed in this paper are related to a series of investigations on the effects of a particular tumor treatment, using DCE-MRI techniques. Here, human mammary and pancreatic carcinoma fragments were subcutaneously injected in the right flank of 42 female rats at the level of the median-lateral. The details about the experiment outstand the scope of the paper (see [13] for details). After the injection of a contrast compound in the animals, MRI images were acquired for tumor localization and good visualization of extratumoral tissues.
4.1 Signal Feature Validation As first experiment, we evaluate the effectiveness of the segmentation by comparing the
clusters obtained with the signal features with those obtained with the standard tumor microvessel parameters. In both cases we carefully tune the bandwidth, in order to find the best segmentation according with histological principles supported by medical researchers. 씰Figure 3 shows one slice segmented with both approaches. Even if apparently the segmentations seem visually similar, an accurate evaluation of the statistical properties of the obtained clusters reveals the better results obtained by the proposed approach. In Figure 3 the schemes show the mean curves of the DCE signals belonging to the same cluster. Beside the mean, for each cluster it is evidenced the median, and the variance. It is worth noting that 1) in general, good k PS- and f PV-based segmentations tend to be characterized by a large number of clusters. With a lower number of clusters, obtained by augmenting the MS bandwidth value, the segmentation decays in quality. In the figure, we report the three most meaningful clusters, out of nine; 2) the intracluster variance is, in general, high; 3) the mean curves of the clusters do not appear so different among each other as expected. After our clustering process instead, 1) the clusters are less in number and meaningful; 2) the intra-class variance is lower, as compared to the other clustering approach; 3) the profiles Methods Inf Med 3/2009
251
252
U. Castellani et al.: DCE-MRI Data Analysis for Cancer Area Classification
Fig. 4 Experiment 1: Clustering results obtained with the mean shift algorithm (a) and the SVM classifier (b) respectively. The curves of the mean-signals are also visualized for all the clusters.
of the mean curves are coherent with the expected behavior of the signals (in a histological sense). More in details, in the necrotic poorly vascularized (i.e., cluster 3) the contrast agent concentration enhances linearly. On the contrary, the active area (i.e., cluster 2) evidences a more rapid enhancement. The peak of the curve is reached in the early side of the signal and then it decades slowly. Finally, DCE signals of the area associated to cluster 1
show a rapid enhancement with a slow decay, meaning that zones of tissue previously vascularized are approaching a necrotic state.
4.2 Pipeline Validation Following the proposed pipeline, we complete the experiment by segmenting a further subject (beside the subject used for signal fea-
ture validation) affected with the same kind of tumor of the previous cases. As mentioned before, different parameters are used to estimate the best clustering results in both the subjects. Therefore, the SVM is trained to recognize the three extracted classes. Indeed, the tissue classification is performed to a third unseen subject with the same tumor. 씰Figure 4 shows four slices of both the segmentation obtained from the training and the
Fig. 5 Experiment 2: Clustering results obtained with the mean shift algorithm (a) and the SVM classifier (b) respectively. The curves of the mean-signals are also visualized for all the clusters. Methods Inf Med 3/2009
© Schattauer 2009
U. Castellani et al.: DCE-MRI Data Analysis for Cancer Area Classification
testing subjects respectively. Moreover, it is also shown the respective statistics collected on each cluster provided by the classifier. Note that the extracted regions and the respective statistics in both the cases a and b in Figure 4 exhibit the same behavior. The classification has been validated in two ways. The first one is based on the medical researcher analysis which confirmed the observations described above. As second validation, we applied the MS clustering also to the third subject, again by carefully tuning the parameters. By using the new obtained clustering results as ground truth, the SVM-based voxel classification reached the 89% of accuracy. The same proposed pipeline is applied to three subjects with a new kind of tumor. Again, subjects 1 and 2 are used to train the SVM, while subject 3 represents the test. 씰Figure 5 shows the clustering results and the related statistics. Also in this case the behavior of the SVM classifier is coherent with the clustering results and in accordance with the expectations of the medical researchers. Note that in both the experiments, the estimated tumoral regions correspond to the one segmented by hand at steps 4 and 5 of the classical DCE-MRI analysis discussed in Section 2. We have carried out further experiments on new subjects affected by different kinds of tumors by observing mainly the same behavior: stable clusters of meaningful regions, as expected by the medical researchers.
5. Conclusions In this paper, we introduce a new methodology, aimed at improving the analysis and the characterization of tumor tissues. The
© Schattauer 2009
multidimensional output obtained by noninvasive tissue analysis, namely the dynamic contrast-enhanced MRI (DCE-MRI) technique is considered. The signals of each voxel are parameterized by few and compact features which robustly summarize the shape of the signals, as expected by the medical researchers. We show that the proposed signal features perform better than standard tumor microvessel parameters, in segmenting the data. Moreover, we show the effectiveness of the proposed method based on the combination of clustering and classification techniques. The obtained results allow the evidencing of a histologically meaningful partition, that individuates tissue zones differently involved with the development of the tumor. The proposed method achieves two goals: 1) it permits an analysis of the tissue more precise and 2) faster than the manual analysis classically performed. These two results assess that the proposed machine learning approach well behaves with medical segmentation and classification issues, related to the DCE-MRI context.
References 1. Lau PY, Ozawa S, Voon FCT. Using Block-based Multiparameter Representation to Detect Tumor Features on T2-weighted Brain MRI Images. Methods Inf Med 2007; 46: 236–241. 2. Duda RO, Hart PE, Stork DG. Pattern Classification. John Wiley and Sons, second edition; 2001. 3. Castellani U, Cristani M, Combi C, Murino V, Sbarbati A, Marzola P. Visual MRI: Merging information visualization and non-parametric clustering techniques for MRI dataset analysis. Artificial Intelligence in Medicine 2008; 44 (3): 171–282. 4. Buckley DL. Uncertainty in the analysis of tracer kinetics using dynamic contrast enhanced T1-weighted MRI. Magnetic Resonance Med 2002; 47: 601–606.
5. Harrer JU, Parker GJ, Haroon HA, Buckley DL, Embelton K, Roberts C, et al. Comparative study of methods for determining vascular permeability and blood volume in human gliomas. Magnetic Resonance Imaging 2004; 20: 748–757. 6. Lavinia C, De Jongea MC, Van de Sandeb MGH, Takb PP, Nederveena AJ, Maas M. Pixel-by-pixel analysis of DCE MRI curve patterns and an illustration of its application to the imaging of the musculoskeletal system. Magnetic Resonance Imaging 2007; 25: 604–612. 7. Tofts PS, Briks G, Buckley DL, Evelhoch JL, Henderson E, Knopp MV et al. Estimating kinetic parameters from dynamic contrast enhanced T1-w MRI of a diffusible tracer: standardized quantities and symbols. Magnetic Resonance Imaging 1999; 10: 223–232. 8. Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 2002; 24: 603–619. 9. Burges C. A tutorial on support vector machine for pattern recognition. Data Mining and Knowledge Discovery 1998; 2: 121–167. 10. Stoutjesdijk MJ, Veltman J, Huisman MD, Karssemeijer N, Barents, J et al. Automatic analysis of contrast enhancement in breast MRI lesions using mean shift clustering for ROI selection. Journal of Magnetic Resonance Imaging 2007; 26: 606–614. 11. Twellmann T, Saalbach A, Muller C, Nattkemper TW, Wismuller A. Detection of suspicious lesions in dynamic contrast-enhanced MRI data. Engineering in Medicine and Biology Society 2004. pp 454–457. 12. Leistritz L, Hesse W, Wustenberg T, Fitzek C, Reichenbach JR, Witte H. Time-variant analysis of fast-fMRI and dynamic contrast agent MRI sequences as examples of 4-dimensional image analysis. Methods Inf Med 2006; 45: 643–650. 13. Marzola P, Ramponi S, Nicolato E, Lovati E, Sandri M, Calderan L, Crescimanno C, Merigo F, Sbarbati A, Grotti A, Vultaggio S, Cavagna F, Lo Russo V, Osculati F. Effect of tamoxifen in an experimental model of breast tumor studied by dynamic contrast-enhanced magnetic resonance imaging and different contrast agents. Investigative radiology 2005; 40: 421–429. 14. Marzola P, Degrassi A, Calderan L, Farace P, Crescimanno C, Nicolato E, Giusti A, Pesenti E, Terron A, Sbarbati A, Abrams T, Murray L, Osculati F. In vivo assessment of antiangiogenic activity of su6668 in an experimental colon carcinoma model. Clin Cancer Res 2004; 2: 739–50. 15. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. 2001.
Methods Inf Med 3/2009
253
254
© Schattauer 2009
Original Articles
Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records D. Klimov; Y. Shahar; M. Taieb-Maimon Medical Informatics Research Center, Ben Gurion University of the Negev, Beer-Sheva, Israel
Keywords Multiple patients, intelligent visualization, interactive visual data mining, temporal association, temporal abstraction, knowledgebased systems
Summary Objectives: To design, implement and evaluate the functionality and usability of a methodology and a tool for interactive exploration of time and value associations among multiple-patient longitudinal data and among meaningful concepts derivable from these data. Methods: We developed a new, user-driven, interactive knowledge-based visualization technique, called Temporal Association Charts (TACs). TACs support the investigation of temporal and statistical associations within multiple patient records among both concepts and the temporal abstractions derived from them. The TAC methodology was implemented as part of an interactive system, called VISITORS, which supports intelligent visualization and explora-
Correspondence to: Denis Klimov Medical Informatics Research Center Department of Information Systems Engineering Ben Gurion University of the Negev P.O. Box 653 Beer-Sheva 84105 Israel E-mail:
[email protected]
Methods Inf Med 3/2009
tion of longitudinal patient data. The TAC module was evaluated for functionality and usability by a group of ten users, five clinicians and five medical informaticians. Users were asked to answer ten questions using the VISITORS system, five of which required the use of TACs. Results: Both types of users were able to answer the questions in reasonably short periods of time (a mean of 2.5 ± 0.27 minutes) and with high accuracy (95.3 ± 4.5 on a 0–100 scale), without a significant difference between the two groups. All five questions requiring the use of TACs were answered with similar response times and accuracy levels. Similar accuracy scores were achieved for questions requiring the use of TACs and for questions requiring the use only of general exploration operators. However, response times when using TACs were slightly longer. Conclusions: TACs are functional and usable. Their use results in a uniform performance level, regardless of the type of clinical question or user group involved.
Methods Inf Med 2009; 48: 254–262 doi: 10.3414/ME9227 prepublished: April 20, 2009
1. Introduction: Interactive Exploration and Mining of Time-oriented Data of Multiple Patients 1.1 Exploration of Multiple Patient Data Analysis of large patient population data, such as clinical trial results, quality assessments of clinical management processes, and the search for the temporal patterns requires a tool that provides aggregate views of clinically meaningful interpretations of the longitudinal data of multiple patients. The current visual data mining methods [1, 2] and visual exploration systems for multiple patient data, particularly in medicine typically, focus only on raw patient data, either via the design of visualization techniques for data exploration [3, 4] or through visual mining [5]. The user, however, requires additional cognitive and computational mechanisms to derive meaningful conclusions from the results of the analysis. Aigner et al. [6] overview several visualizations of time-oriented data and emphasize the importance of the integration of visual, analytical and user-centered methods. To derive meaningful patterns and interpretations, called temporal abstractions, from the raw time-oriented patient data, we have been using the knowledge-based temporalabstraction (KBTA) method [7]. We separate the concepts in the domain ontology into raw concepts (e.g., hemoglobin value, or age), and abstract concepts (temporal abstractions) (e.g., levels of anemia), which are derivable from raw concepts. Through a domain-specific temporal-abstraction knowledge base acquired from a domain expert by using appropriate tools [8, 9], this method derives inter-
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
val-based temporal abstractions, such as the pattern “a period of more than two months of grade I or higher bone-marrow toxicity, followed within three months by a decrease in liver functions”, as this pattern is defined in the context of a particular oncology therapy protocol. The temporal abstractions computed by the KBTA method for an individual patient or for a small number of patients (typically less than ten patients) can be efficiently visualized through an ontologydriven interface, called KNAVE-II [10], which has been shown to be functional and usable [11]. To completely support the exploration of multiple, time-oriented patients’ data (including both raw data and derivable temporal abstractions), we designed and developed an
enhanced extension of the KNAVE-II system, a new system called VISualizatIon of TimeOriented RecordS (VISITORS), which integrates knowledge-based temporal reasoning mechanisms for deriving temporal patterns and abstractions with information visualization methods for the display, exploration, and analysis of associations among multiple patients’ records. Unique to the VISITORS system is its temporal focus, supported by the VISITORS time-oriented exploration capabilities, which we refer to as general exploration operators: unlike the KNAVE-II system, data of one or of multiple patients can be both aggregated at, and explored within, various temporal granularities, such as hour, day, and month (or other specific time periods). The VISITORS system supports both a calen-
Fig. 1 The VISITORS main interface, in this case, in an oncology domain. The two top panels (denoted by A) display lists of patients and lists of time intervals, returned as answers to previous queries. The graphs (denoted by B) show the data, for a group of 58 patients, of the white blood cell count (WBC) raw concept (graph 1) and of the monthly distribution of the abstract concept © Schattauer 2009
daric (absolute) timeline and a timeline relative to special events (e.g., the months following a bone-marrow transplantation event) (씰Fig. 1). Moreover and quite differently, the VISITORS system enables users to interactively specify temporal and knowledge-based constraints, through a graphical query-building module, which define the patient subsets selected for exploration (e.g., the lists of patients displayed in panel A of Fig. 1). Underlying the query-building module is the ontology-based temporal-aggregation (OBTAIN) query language. The query-building interface enables three types of queries supported by the OBTAIN query language: Select Patients (Who?), Select Time Intervals (When?), and Get Patient Data (What?). Each query re-
Platelet-state values during 1995 (graph 2). Graph 3 shows the monthly distribution of the Hemoglobin-state abstract concept values during the first year (relative time line) following bone-marrow transplantation. The left panel (denoted by C) of the interface includes a browser to the knowledgebase in the oncology domain.
Methods Inf Med 3/2009
255
256
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
trieves either a list of patients, a list of relevant time intervals, or a list of time-oriented patient data sets, respectively, combinations of which the user can manipulate according to which patients, time periods and data values are to be analyzed further. For example, a typical Select Patients query would be “Find patients who had, during the first month following a bone marrow transplantation (BMT), at least one episode of bone-marrow toxicity (an abstraction defined by the clinical protocol) of grade I or higher, which lasted at least two days”; a typical Select Time Intervals query would be “Find periods (relative to the BMT event) during which the WBC-state value was less than ‘normal ’ for at least 50% of the patients”. A full exposition of the OBTAIN language and query building module is outside the scope of this study and can be found elsewhere [12]. Similarly, we refer to Klimov and Shahar [13] for details on the semantics of the VISITORS interface and of its general data exploration operators.
1.2 Exploration of Potentially Meaningful Temporal Associations One of the more interesting tasks in the analysis of multiple patient data is an investigation of a new, potentially meaningful interrelation, especially a temporal interrelation, which we refer to as a temporal-association task, within a set of raw patient data and abstract concepts. An example of a system in the medical domain for visualizing associations among multiple patients is The Cube [14]. The Cube enables interactive recognition of patient patterns through a 3D display of a set of 2D parallel diagrams (each using a horizontal time axis and a vertical value axis), where each diagram represents a patient attribute (e.g., allergies). Thus, similar patients might have parallel lines connecting the different 2D diagrams. However, The Cube works only with raw data and does not automatically aggregate similar patients into groups. To fully support the temporal-association task, we developed the Temporal Association Chart (TAC) module (Sections 2 and 3), which we designed and implemented within the VISITORS system. We demonstrate our ideas using examples from the oncology doMethods Inf Med 3/2009
main (Section 4). To evaluate the VISITORS system, and in particular the TAC module, five clinicians and five medical informaticians answered ten clinical questions, five of which required the use of TACs (Section 5).
2. Aggregation of Patients’ Time-oriented Data and of their Abstractions To aggregate patients’ data at arbitrary temporal granularities or during specific time periods, we defined the concept of a delegate function which computes a delegate value: Given a single patient’s time-oriented data for a specific concept (raw or abstract) over a particular time interval (including a predefined granularity level), we calculate the delegate value of the patient’s data at each time granule (or at some other particular time period) using a function specific to the concept, the context, and the temporal granularity. Such calculation is performed for each patient in the group. More formally, the input data has the following data structure: input_data ≡ <Patientn , Conceptc , Tsn,m , Ten,m , valuec,n,m >*, 1 = n = N, 1 = m = Mn , where N is the number of patients. Mn is the number of values of Conceptc for Patientn ; Tsn,m and Ten,m are the start and end times of the m-th observation (or temporal abstraction) for Patientn , with value valuec,n,m . The delegate value for Patientn of Conceptc within a specific aggregation time period [Tsagg , Teagg] is computed by the conceptspecific delegate function DFconcept from the input_data as follows: delegate_valuec,n,Tsagg,Teagg = DF [(Tsn,1 , Ten,1 , valuec,n,1),…(Tsn,i , Ten,i , valuec,n,i), … (Tsn,K , Ten,K , valuec,n,K)], Tsagg ≤ Tsn,i , Ten,i ≤ Teagg , 1 ≤ i ≤ K = Kc,n ,Tsagg ,Teagg where K = Kc,n,Tsagg,Teagg is number of instances of Conceptc for Patientn measured within the [Tsagg, Teagg] period. That is, K varies per each concept, patient, and time period. The delegate function of each concept is defined in the knowledge base, or is chosen at runtime by the user from several predefined
default functions. For example, assume that the results of a patient’s three blood glucose (BGL) observations on January 1 revealed the following values: 92 g/dl at 5 a.m., 140 g/dl at 11 a.m. and 182 g/dl at 8 p.m. If maximum is the default daily delegate function for BGL, then the patient had a daily delegate value of 182 g/dl for BGL. However, the user can choose another suitable delegate function (such as the mode or the mean). Indeed, for granularity of months, it might be preferable to use the mean as a delegate function (applied to all raw data within each month). In the case of interval-based temporal abstractions, such as intervals of different grades of bone-marrow toxicity, we provide additional delegate functions, such as the value of the abstractions that has the maximal cumulative duration during the relevant time period. Indeed, in theory, almost any function from multiple values into one value (with same units) can serve the role of a delegate function. However, it must be applied to each time interval in the relevant temporal granularity (e.g., day), and, of course, must make clinical sense, and is thus specific to each clinical concept and medical context.
3. Temporal Association Charts The core of a Temporal Association Chart (TAC) is an ordered list of raw and/or abstract domain concepts (e.g., platelet state, hemoglobin value, WBC count), in a particular order determined by the user. Each concept is measured (or computed, in the case of a temporal abstraction) for a particular patient group during a concept-specific particular time period. The period can be different for each concept. Between every pair of consecutive concepts in the list, a set of relations amongst the delegate values, for each patient, of these neighboring concepts, will be computed. If one of the concepts is raw, each relation will be between a delegate value of the first concept and a delegate value of the second concept for each patient. If both concepts are abstract, the relations between the delegate values for all patients will be aggregated into a set of extended relations – temporal association rules, one rule per each combination of values from both concepts. Each © Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
rule represents the set of patients who have had this particular combination of values for the two abstract concepts. TACs are created by the user in two steps. First, the user selects two or more concepts, using an appropriate interface (not shown here), possibly changing the order as necessary; second, the user selects the group of patients (e.g., from a list of groups retrieved earlier by Select Patients queries). In the current version, the VISITORS system does not recommend which concepts to select, nor the time periods in which to examine their, nor group of patients. However, as we explain in the Discussion, we intend to combine the VISITORS systems with pure computational tools (that we have been developing) for detection of sufficiently common temporal associations.
3.1 Temporal Association Templates A Temporal Association Template (TAT) is an ordered list of time-oriented concepts (TOCs) (| TOCs | ≥ 2), where each TOC denotes a combination of a raw or derived domain concept (such as a hemoglobin value or a bone-marrow toxicity grade) and a time interval < tstart , tend >. Each TAT is thus a list LTOCs of TOCs, that is: LTOCs = < TOC1 , ...TOCi , ... TOCI >, ∀i, 1 = i = I, TOCi ≡ < Ci , tistart, tiend >, where Ci ∈ C (the set of domain concepts). The time stamps tistart and tiend define the time interval of concept Ci (using either absolute or relative time stamps). I is the number of concepts in the TAT. A concept can appear more than once in the TAT, but only within different time intervals. An example of a TAT listing the hemoglobin-state and WBC-state abstract (derived) concepts, and the plateletcount raw-data concepts, and their respective time intervals, would be <(Hemoglobinstate, 1/1/95, 31/1/95), (WBC-state, 1/1/95, 31/1/95), (Platelet count, 1/1/95, 31/1/95), (WBC-state, 1/2/95, 28/2/95)>. Note that once a TAT is defined, it can be applied to different patient groups. At runtime, a relation
will be created between each pair TOCi and TOCi + 1 , for each patient, such that the delegate value of © Schattauer 2009
concept Ci for that patient during [tistart, tiend] is a value vali and the delegate value of concept Ci + 1 for that patient during [ti + 1start, ti + 1end] is vali + 1.
Given a pair of instantiated TOCs , the set of association relations (AR) between them is:
3.2 Application of a Temporal Association Chart Template to the Set of Patient Records
where N is the number of relevant patients, and I is the number of concepts in the TAC. When at least one of the concepts is raw, the number of ARs between each pair of TOCs is equal to the number of relevant patients. Each AR connects the delegate values val ni and val ni + 1 of the pair of concepts Ci and Ci + 1 , during the relevant period of each concept, for one specific patient Pn . In the case of an abstract-abstract concept pair, we aggregate the ARs between two consecutive TOCs into groups, where each group includes a set of identical pairs of delegate values (one pair for each concept). Each such group denotes a temporal association rule (TAR) and includes: ● Support: the proportion of patients who have the combination of delegate values , 1 ≤ j ≤ J, 1 ≤ k ≤ K, where val ni, j, valni + 1, k are the j-th and k-th allowed values of Ci and Ci + 1 , respectively. J and K are the numbers of different values of the concepts Ci and Ci + 1 . (We assume a finite number of (symbolic) values for each abstract concept.) ● Confidence: the fraction of patients that, given the delegate value val ni, j of the concept Ci for patient Pn , the delegate value of concept Ci + 1 will be valni + 1, k , i.e., P[val ni + 1, k | val ni, j]. ● Actual number of patients: the number of patients who have this combination of values.
When applying a TAT to a set P of patient records including N patients, we get a Temporal Association Chart (TAC). A TAC is a list of instantiated TOCs and association relations (AR), in which each instantiated TOC is composed of the original TOC of the TAT upon which it is based, and the patientspecific delegate values for that TOC within its respective time interval, based on the actual values of the records in P. Each set of the associations denotes the associations between pairs of consecutive instantiated TOCs , 1 ≤ i < I. To be included in a TAC, a patient Pn (1 = n = N) must have at least one value from each TOC of the TAT defining the TAC. The group of such patients is the relevant group (relevant patients). In the resulting TAC, each instantiated TOC *i includes the original TAT TOCi and the set of delegate values (one delegate value for each patient) of the concept Ci , computed using the delegate function appropriate to Ci from the set of patient data included within the respective time interval [tistart, tiend] as defined in the TAT: TOC *i ≡ {, [Dist]}, 1 ≤ n ≤ N, 1 ≤ i ≤ I, where val in is the delegate value of Ci within the period [tistart, tiend] for patient Pn , N is the number of patients in the relevant group, and I is the number of concepts in the TAT. Dist is an optional distribution data structure {Propil >}, where val il is the l-th value of concept Ci , and Prop il is its proportion within the group of patients P. The optional Dist structure is useful only for abstract concepts and supports the visualization of the relative proportion (i.e, distribution) of all the values of Ci for the N relevant patients, within the time interval of the instantiated TOC *i (see Section 4).
AR ≡{ val ni+ 1 >}, 1 ≤ n ≤ N, 1 ≤ i < I
The number of possible TARs between two consecutive TOCs is thus J * K. The support and confidence measures are calculated as follows: support (val ni, j , val ni + 1, k) ≡ {Pi, j i + 1, k}| /N confidence (valni, j , val ni + 1, k) ≡ |{P i, ji + 1, k }| /M 1 ≤n ≤N
1 ≤i < I
1 ≤j ≤J
1 ≤k ≤K
where |{P ii +, 1,j k}| is the number of patients whose delegate value for concept Ci was valni, j and the delegate value for concept Ci + 1 was val ni + 1, k , N is the number of relevant patients, Methods Inf Med 3/2009
257
258
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
M is the number of patients whose delegate value for concept Ci was val ni, j, I is the number of concepts in the TAC, and J and K are the number of symbolic values of concepts Ci and Ci + 1 , respectively.
4. Display of TACs and Interactive Data Mining Using TACs 씰Figure 2 presents an example of a TAC computed by applying a TAT (user-defined on the fly, using another interface (not shown here) that enables the user to select TAT concepts). The TAT includes three hematological concepts (platelet state, hemoglobin (HGB) state abstractions, and the white blood cell (WBC) count raw concept) and two hepatic concepts (total bilirubin and Alkphosphatase state abstraction) applied to a group of 58 patients selected earlier by the user. The visualization in Figure 2 shows the dis-
tribution of the values, using the optional Dist structure (see Section 3.2), for the abstract concepts HGB, Platelet, and Alkphosphatase states; it also shows each patient’s mean values for WBC count and total bilirubin during the year 1995. Delegate values of all adjacent concept pairs for each patient are connected by lines, denoting the ARs. Only 49 patients in this particular group happen to have data for all concepts during 1995. As described above, ARs among values of temporal abstractions provide additional statistical information. For example, the AR’s width indicates to the user the support for each combination of values, while the color saturation represents the level of confidence: a deep shade of red signifies high confidence while pink denotes lower confidence. The support, confidence, and the number of patients in each association are displayed numerically on the edge. For example, the widest edge in Figure 2 represents the relation between the “low” value of the platelet state
and the “moderately low” value of the HGB state during the respective time periods: 55.8% of the patients in the relevant patient group had this combination of values during these periods (i.e. support = 0.558), with “low” platelet state values, 92.6% of the patients exhibited “moderately low” HGB state values (i.e. confidence = 0.926), and this association was valid for 25 patients. Note that in this case both time periods are similar. Using direct manipulations [15], the user can dynamically apply a value and time lens in the TACs. Generally, the term direct manipulation is defined as a human-computer interaction style which involves continuous representation of objects of interest, and rapid, reversible, incremental actions and feedback. In our case, the direct manipulations enable the user to interactively analyze the time and value associations among multiple patients’ data: ● Dynamic application of a value lens enables the user to answer the question “how does constraining the value of one concept
Fig. 2 Visualization of associations among three hematological and two hepatic concepts for 49 patients during the year 1995. Association rules are displayed between the Platelet-state and HGB-state abstract concepts. The confidence and support scales are represented on the left. Methods Inf Med 3/2009
© Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
●
during a particular time period affect the association between multiple concepts during that and/or during additional time periods”. The user can either select another range of values for the data of the raw concepts using trackbars, or select a subset of the relevant values in the case of an abstract concept. In the next versions, we are planning to allow the user to vary also the delegate function to enable additional analyses. The system also supports the application of a time lens by changing the range of the time interval for each instantiated TOC, including ranges on the relative time line. The time lens can be especially useful for clinical research involving longitudinal monitoring.
In addition, the user can change the order of the displayed concepts, export all of the visualized data and associations to an electronic spreadsheet, and add or remove displayed concepts.
5. Evaluation of the Functionality and Usability of Temporal Association Charts 5.1 Research Questions We envision TACs as potentially useful for two user types: clinicians and medical informaticians. We also envision them using the system to answer different clinical motivated questions, while using also the general exploration operators of the VISITORS system. Thus we defined the following three research questions.
5.1.1 Functionality and Usability Are clinicians and medical informaticians able to answer clinical questions, which require the use of TACs, at a high level of accuracy and within a reasonably short time? Furthermore, is the integrated VISITORS/ TAC system usable, when assessed using the SUS score [16]?
© Schattauer 2009
5.1.2 The Effect of the Clinical Question Are there significant differences in accuracy and time when answering different clinical questions that required the use of TACs?
5.1.3 The Effect of the Interaction Mode Are there significant accuracy or time to answer differences when answering questions requiring only the use of general VISITORS exploration operators, as opposed to answering questions requiring the use of TACs?
5.2 Measurement Methods and Data Collection When evaluating a new tool such as TACs, it is difficult to produce a control group. As far as we know, no known method duplicates the effect of using either VISITORS or TACs. Furthermore, the potential users simply cannot answer the complex questions (for which TACs are designed) other than by laborious computations. Thus, we have chosen an objective-based approach [17]. In such an approach, certain reasonable objectives are defined for a new system, and the evaluation strives to demonstrate that these objectives have been achieved. In this case, we strove to prove certain functionality and usability objectives of the TACs system when evaluated within the context of a larger framework (VISITORS) for exploration of the timeoriented data of multiple patients. Our evaluation measures and specific research questions, listed below, reflected these objectives. The evaluation of the TACs was performed in the oncology domain. Ten participants, five medical informaticians, i.e., information system engineers who work in the medical domain, and five clinicians with different medical training levels, were each asked to answer five clinical questions that require using TACs (five questions are listed in the Results section). None of the study participants was a member of the VISITORS development team. The five questions were selected by consultation with oncology domain experts. They represented typical questions relevant when monitoring a group of oncology patients, or when performing an analysis
of an experimental protocol in an oncology. The order of the questions was randomly permuted across participants. Each evaluation session with a participant started with a 20-minute tutorial that included a brief description of the VISITORS general exploration operators and of the TAC operators. A demonstration was given of the general and TAC operators showing how several typical clinical questions are answered. The scope of the instruction was predetermined and included (after the demo) testing each participant by asking them to answer three clinical questions, one of which included the use of TAC. When the participant could answer the questions correctly, he/she was considered ready for the evaluation. The TACs evaluation study was performed as a part of an overall feasibility and usability assessment of the VISITORS system. Another part of the evaluation involved testing the feasibility and usability of the general exploration operators by asking clinical questions such as “What were the maximal and mean monthly values of the WBC count during August 1995?”. Throughout the evaluation, we used a retrospective database of more than 1000 oncology patients who had a BMT event. The knowledge source used for the evaluation was an oncology knowledge base specific to the bone-marrow transplantation domain. Our goals for the TACs objectives-based evaluation were manifested in our evaluation measures. The functionality was assessed using two parameters: the time in minutes needed to answer the question, and the accuracy of the resultant answer. The accuracy score assigned to each possible answer (in each case, measured on a scale of [0 ... 100], 0 being completely wrong and 100 being completely right), was predetermined by a medical expert. To test the usability of TACs, we used the system usability scale (SUS)[16], a common validated method to evaluate interface usability. The SUS is a questionnaire that includes ten predefined questions regarding the effectiveness, efficiency, and satisfaction of an interface. SUS scores have a range of 0 to 100. Informally, a score higher than 50 is considered to indicate a usable system.
Methods Inf Med 3/2009
259
260
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
5.3 Analysis Methods 5.3.1 Functionality and Usability User capability in answering clinical questions using the TACs was assessed by calculating the means and standard deviations of the answers accuracy and of the answers response time.
5.3.2 The Effect of the Clinical Question The effects of five clinical questions and of two groups of participants (i.e., medical informaticians and clinicians) on the dependent variables of response time and accuracy of answer were examined using two different two-way ANOVA tests with repeated measures (one for each dependent variable). The clinical question was a within-subject independent variable, and the group of participants was a between-subjects independent variable.
5.3.3 The Effect of the Interaction Mode The effects of the interaction mode (i.e., general exploration operators and TACs) and the group of participants (i.e., medical informaticians and clinicians) on the dependent variables of response time and accuracy of answer were examined using two different two-way ANOVA tests with repeated meas-
ures (one for each dependent variable). The interaction mode was a within-subject independent variable and the group of participants was a between-subject independent variable. Since we did not find statistically significant differences among the response times (and among the resultant accuracy levels) of the different clinical questions of the same interaction mode, the mean value of the response time (and of the accuracy) of the five clinical questions of the same interaction mode was used as the dependent variable.
5.4 Results This section summarizes the evaluation results of the TAC in terms of the research questions.
The mean response time was 2.7 ± 0.4 minutes. Only two participants needed more than 3 minutes (3.2 and 3.6 minutes) to answer. The range of mean response times per participant across all questions was [2.2 ... 3.0] minutes. The mean SUS score for all operators, across all participants, was 69.3 (over 50 is usable). The results of a t-test analysis showed that the mean SUS score of the medical informaticians (80.5) was significantly higher than that of the clinicians (58): [t (8) = 3.88, p < 0.01]. Conclusion: Based on the results of the TAC evaluation, we can determine that after a very short training period, the participants were able to answer the clinical questions with very high accuracies and within short periods of time. The SUS scoring shows that TACs are usable but still need to be improved.
5.4.1 Functionality and Usability
5.4.2 The Effect of the Clinical Question
씰Table 1 summarizes the results according to the clinical questions used in the evaluation. The mean accuracy was 97.9 ± 3.4 (the median was 100 for all five questions, with interquartile range of zero for questions one to four and 10 for question five). All participants successfully answered the clinical questions with mean accuracies greater than 90, while six of them achieved mean accuracies of 100. The range of mean accuracies per participant across all questions was [90 ... 100].
Both analyses yielded no significant effect. The interaction effect group × clinical question was not significant both in the ANOVA of the accuracy scores [F (4, 32) = 2.02, p = 0.12], and of the response times [F (4, 32) = 0.73, p = 0.58]. There was no significant difference between the mean accuracy scores/ response times of the clinicians (96.3 ± 4.3)/ (2.6 ± 0.2 min) and of the medical informaticians (99.5 ± 0.9)/ (2.9 ± 0.5 min); accuracy: [F (1, 8) = 2.72, p = 0.14, regression coef-
Table 1 Details of the questions that all participants had to answer, mean response times (minutes) and accuracy scores N
Clinical questions
1
2.4 ± 0.5 What percentage of the patients has had a “low” delegate value of the Platelet-state concept? What percentage of the patients has had a “moderately low” delegate value of the HGB-state concept? What percentage of the patients has had both a “low” value of the Platelet-state concept and a “moderately low” value of the HGB-state concept?
99.9 ± 0.3
2
What delegate value of the HGB-state derived concept was the most frequent among the patients who have had a “low” aggregate value of the Platelet-state?
2.4 ± 0.5
99.8 ± 0.6
3
What were the maximal and minimal delegate values of the WBC count for patients who have the HGB-state delegate value “moderately low”? What were the maximal and minimal delegate values of the RBC for patients who have had a delegate HGB-state value that was “normal”?
2.9 ± 1.0
96.0 ± 9.0
4
What is the distribution of the delegate values of the Platelet-state in patients whose minimal delegate value of the WBC count raw concept was 5000 cells/ml (instead of the previous minimal value)? What were the new maximal and minimal delegate values of the RBC?
3.0 ± 0.7
98.8 ± 4.0
5
What percentage of the patients has had a “low” delegate value of the Platelet-state during both the first and second month following bone-marrow transplantation?
3.0 ± 0.8
96.0 ± 5.0
Methods Inf Med 3/2009
Response time (mean ± s.d.)
Accuracy score [0..100] (mean ± s.d.)
© Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
ficient ± sd = 3.2 ± 1.95]/time: [F(1, 8) = 1.90, p = 0.20, regression coefficient ± sd = 0.3 ± 0.26]. There was no significant difference also between the mean accuracy scores/response times of the five clinical questions (for mean accuracy ± sd/response times ± sd; see Table 1); accuracy: [F(4, 32) = 2.29, p = 0.08, maximal estimated effect ± sd = 4.8 ± 3.1]/time: [F(4, 32) = 2.24, p = 0.09, maximal estimated effect ± sd = 0.6 ± 0.33]. Conclusion: Thus, we can conclude that neither the clinical question nor the group seem to affect the accuracy or the response time of answers provided using the TAC module.
5.4.3 The Effect of the Interaction Mode The results of the ANOVA of the accuracy scores showed that the interaction effect group × interaction mode was not significant [F(1, 8) = 2.64, p = 0.14]. There was no significant difference between the mean accuracy scores of the clinicians (97.7 ± 3.0) and of the medical informaticians (99.8 ± 0.4); [F(1, 8) = 2.37, p = 0.16, regression coefficient ± sd = 2.1 ± 1.4]. There was no significant difference also between the mean accuracy scores of the answers obtained by using the general exploration operators (99.5 ± 1.6) and by using the TACs (97.9 ± 3.4); [F(1, 8) = 4.80, p = 0.07, regression coefficient ± sd = 2.3 ± 1.6]. With respect to the response time, the results of the analysis showed that the only significant effect was the main effect of the type of interaction mode [F(1, 8) = 14.96, p < 0.01]: a mean of 2.2 ± 0.18 minutes for answering the clinical questions when using the general exploration operators of VISITORS, and a mean of 2.7 ± 0.43 minutes for answering the clinical questions using the TACs. There was no significant difference between the mean response times of the clinicians (2.6 ± 0.2 min) and the medical informaticians (2.9 ± 0.5 min); [F(1, 8) = 3.24, p = 0.11, regression coefficient ± sd = 0.3 ± 0.16]. The interaction effect group × interaction mode was also insignificant [F(1, 8) = 0.42, p = 0.54]. Conclusion: Interaction mode does not seem to affect the accuracy of the answers to clinical questions. The mean time needed to answer the clinical scenarios using the TACs is © Schattauer 2009
significantly higher than when using the general exploration operators of VISITORS, but it is still less than three minutes.
5.5 Results of Power Analysis Since we have not found significant effects in the results of research questions 2 and 3, we performed a statistical power analysis. For each two-way ANOVA test in the results of research question 2 (for each dependent variable), we performed a power analysis, both for each main effect, namely, the five clinical questions and the two groups of participants, and for the interaction effect (questions X groups). In the case of research question 3, the main effects for which the power analysis was performed were the two interaction modes (general exploration operators vs. TACs) and the two groups of participants, and for the interaction effect (modes Χ groups). The results showed that, in the case of accuracy, assuming a meaningful difference being at least 5 points on a scale of (0 ... 100), the experiment (i.e., N = 10, α = 0.05) would detect an effect (i.e., a difference between the means of the groups) with a probability of at least 80%, which is considered a reasonable power. Similarly, in the case of response time, assuming a meaningful difference being at least one minute. Moreover, the power analysis had shown that with a larger group, including 20 participants (10 clinicians and 10 medical informaticians), a smaller effect of only two points in the mean accuracy (the minimal effect size obtained in our analysis) could be detected with a probability of 80%. However, we consider a difference of two points or less to be relatively insignificant for practical purposes. Similarly in the case of response times for an effect size of 0.3 min with a sample size of 54 participants (26 clinicians and 26 medical informaticians).
6. Discussion This paper presents the Temporal Association Chart, a computational and interaction module that enables users to graphically explore and analyze the time and value associations among domain concepts that explicitly or
implicitly exist within multiple time-oriented patient records. Moreover, it enables the exploration of 1) intelligent interpretations (temporal abstractions) of the raw data derived using the context-sensitive domainspecific knowledge, and 2) temporal aggregations of the patient data summarized within several specific time periods (including the use of temporal granularities) using delegate functions to each concept and each temporal granularity. The associations displayed between pairs of consecutive abstract concepts include support and confidence measures that can be interactively investigated via manipulation by the user. Note that when the time periods of pair of concepts are the same, these measures are an interval-based extension of the familiar data mining measures, using delegate functions. When the time periods are different, an extension of temporal association rules, commonly called sequence mining, emerges, again using delegate functions, which allows multiple time granularities and which does not necessitate the simultaneous existence of different concepts (known as items). The evaluation of the TAC module, which is integrated within the VISITORS system, has demonstrated its functionality and usability. The only significant difference between the TACs and the general exploration operators was a slightly longer response time. A possible reason for not detecting significant differences in the accuracy scores when using different interaction modes was that the evaluation included a relatively small group of participants and questions. However, it should be noted that the variance among accuracy scores was quite low for both interaction modes, all of the participants achieving scores above 90. Thus, the absence of a significant effect could not be attributed to random differences and high variability in each interaction mode. This conclusion was also supported by the results of the power analysis. Although applying ANOVA in the context of a ceiling effect in the accuracy scores is potentially problematic, this phenomenon, in our judgment, did not have the significant effect on the conclusions of the study study. One possible conceptual limitation of the TAC approach is the use of a goal-directed (user-driven) method for temporal data mining. Thus, the user must have a meaningful intuition regarding the selection of the Methods Inf Med 3/2009
261
262
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
necessary concepts to explore. However, this limitation can be overcome by combining the TAC module with a knowledge-based temporal data mining method, such as the one we have been developing [18]. Associations that have sufficient support are automatically flagged, and in the future, could be visually explored. To summarize, we conclude that TACs might be described as “intelligent equalizers” that result in a uniform performance level with respect to answering complex timeoriented clinical statistical-aggregation questions, regardless of the questions asked, or of the user type.
Acknowledgments This research was supported by Deutsche Telekom Labs at Ben Gurion University and the Israeli Ministry of Defence, BGU award No. 89357628-01.
Methods of Information in Medicine on the internet Methods Inf Med 3/2009
References 1. Bonneau G, Ertl T, Nielson G. Scientific Visualization: The Visual Extraction of Knowledge from Data. New York: Springer-Verlag; 2005. 2. Soukup T, Davidson I. Visual Data Mining: Techniques and Tools for Data Visualization and Mining. New York: John Wiley & Sons, Inc.; 2002. 3. Spenke M. Visualization and interactive analysis of blood parameters with InfoZoom. Artificial Intelligence in Medicine 2001; 22 (2): 159–172. 4. Wang T, Plaisant C, Quinn A, Stanchak R, Shneiderman B, Murphy S. Aligning Temporal Data by Sentinel Events: Discovering Patterns in Electronic Health Records. SIGCHI Conference on Human Factors in Computing Systems, 2008. 5. Chittaro L, Combi C, Trapasso G. Visual Data Mining of Clinical Databases: An Application to the Hemodialytic Treatment based on 3D Interactive Bar Charts. Proceedings of Visual Data Mining VDM’2002, Helsinki, Finland, 2002. 6. Aigner W, Miksch S, Müller W, Schumann H, Tominski C. Visual Methods for Analyzing TimeOriented Data. IEEE Transactions on Visualization and Computer Graphics 2008; 14 (1): 47–60. 7. Shahar Y. A framework for knowledge-based temporal abstraction. Artificial Intelligence 1997; 90 (1–2). 8. Stein A, Shahar Y, Musen M. Knowledge Acquisition for Temporal Abstraction. 1996 AMIA Annual Fall Symposium, Washington, D.C. Published in 1996. 9. Chakravarty S, Shahar Y. Acquisition and Analysis of Repeating Patterns in Time-oriented Clinical Data. Methods Inf Med 2001; 40 (5): 410–420.
10. Shahar Y, Goren-Bar D, Boaz D, Tahan G. Distributed, intelligent, interactive visualization and exploration of time-oriented clinical data and their abstractions. Artificial Intelligence in Medicine 2006; 38 (2): 115–135. 11. Martins S, Shahar Y, Goren-Bar D, Galperin M, Kaizer H, Basso LV, McNaughton D, Goldstein MK. Evaluation of an architecture for intelligent query and exploration of time-oriented clinical data. Artificial Intelligence in Medicine 2008; 4 (3): 17–34. 12. Klimov D, Shahar Y. Intelligent querying and exploration of multiple time-oriented medical records. MEDINFO Annu Symp Proc 2007; 12 (2): 1314–1318. 13. Klimov D, Shahar Y. A Framework for Intelligent Visualization of Multiple Time-Oriented Medical Records. AMIA Annu Symp Proc 2005. pp 405–409. 14. Falkman G. Information visualisation in clinical Odontology: multidimensional analysis and interactive data exploration. Artificial Intelligence in Medicine 2001; 22 (2): 133–158. 15. Shneiderman B, Plaisant C. Designing the user interface: strategies for effective human-computerinteraction. 4th edition. Addison Wesley; March 2004. 16. Brooke J. SUS: a “quick and dirty” usability scale. In: Jordan PW, Thomas B, Weerdmeester BA, McClelland AL, editors. Usability Evaluation in Industry. Taylor and Francis; 1996. 17. Friedman C, Wyatt J. Evaluation Methods in Medical Informatics. New York: Springer; 1997. 18. Moskovitch R. and Shahar Y. Temporal Data Mining Based on Temporal Abstractions. ICDM-05 workshop on Temporal Data Mining, Houston, US. 2005.
www.methods-online.com – see there also our Instructions to Authors © Schattauer 2009
Original Articles
© Schattauer 2009
Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems M. Dugas1; S. Amler1; M. Lange2; J. Gerß1; B. Breil1; W. Köpcke1 1Department 2IT
of Medical Informatics and Biomathematics, University of Münster, Münster, Germany; Centre, Universitätsklinikum Münster, Münster, Germany
Keywords Patient accrual rate, hospital information system, clinical trial
Summary Background: Delayed patient recruitment is a common problem in clinical trials. According to the literature, only about a third of medical research studies recruit their planned number of patients within the time originally specified. Objectives: To provide a method to estimate patient accrual rates in clinical trials based on routine data from hospital information systems (HIS). Methods: Based on inclusion and exclusion criteria for each trial, a specific HIS report is
Correspondence to: Prof. Dr. Martin Dugas Department of Medical Informatics and Biomathematics University of Münster Domagkstraße 5 48149 Münster Germany E-mail: [email protected]
Introduction Delays in patient recruitment are a common problem in clinical trials. Charlson [1] analyzed trials listed in the 1979 inventory of the National Institute of Health. He found that only 14 of 38 (37%) trials reached planned recruitment. Twenty-three years later a review of 114 trials between 1994 and 2003 held by the Medical Research Council and Health Technology Assessment Programmes found that less than one-third recruited their original target within the time originally specified [2]. There is a variety of reasons, such as fewer
generated to list potential trial subjects. Because not all information relevant for assessment of patient eligibility is available as coded HIS items, a sample of this patient list is reviewed manually by study physicians. Proportions of matching and non-matching patients are analyzed with a Chi-squared test. An estimation formula for patient accrual rate is derived from this data. Results: The method is demonstrated with two datasets from cardiology and oncology. HIS reports should account for previous disease episodes and eliminate duplicate persons. Conclusion: HIS data in combination with manual chart review can be applied to estimate patient recruitment for clinical trials.
Methods Inf Med 2009; 48: 263–266 doi: 10.3414/ME0582 received: June 4, 2008 accepted: November 26, 2008 prepublished: March 31, 2009
patients eligible than expected, staff problems, limited funding, complexity of trial design, length of recruitment procedure and others. A recent Cochrane review [3] analyzed strategies to improve recruitment to research studies. Monetary incentives, an additional questionnaire at invitation and treatment information on the consent form demonstrated benefit; the authors concluded that these specific interventions from individual trials are not easily generalizable. Therefore from a methodological point of view, methods are needed to estimate patient accrual rates in clinical trials more precisely.
Hospital information systems (HIS) contain data items, which are relevant for inclusion and exclusion of patients to clinical trials. For instance, diagnosis information is coded in HIS for billing purposes, but can also be analyzed to screen for potential trial subjects [4]. However, electronic patient records contain a lot of unstructured text information, therefore automated data analysis has limitations and expert review of records is needed to assess patient eligibility. In this context, we propose a method to estimate patient accrual rates based on HIS reports in combination with manual review of a sample of HIS records.
Methods Because not all information relevant for assessment of patient eligibility is available as coded HIS data items, a two-stage process to estimate patient accrual rates is applied: First, a list of matching patients is generated with a specific HIS report for a given time span T (for instance, T = [January 1, 2007; December 31, 2007] ). Second, a sample of these patient records is reviewed manually by an expert to assess eligibility and thereby estimate patient accrual rate. HIS reports are database queries which can be generated using reporting tools of the HIS (HIS report generator) or by data queries from a data warehouse. These reports can access all structured data elements within the HIS. Typical examples of HIS data items are admission and discharge diagnoses (primary as well as secondary diagnoses, coded according to international classification of diseases), patient age, patient gender and routine lab values. Depending on inclusion and exclusion criteria of each trial, all suitable HIS items should be considered for this HIS report to provide high recall and precision. Methods Inf Med 3/2009
263
264
M. Dugas et al.: Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems
HIS documentation is focused at a “case”, i.e. a certain episode of care in a hospital with related clinical and administrative data; trials are addressing individual patients. For this reason HIS reports for patient accrual should analyze all HIS cases of a patient to avoid duplicate persons and to account for preexisting diseases. We propose a stepwise approach for those HIS reports: First, select all HIS cases matching inclusion and exclusion criteria; second, remove duplicate persons; third, identify all HIS cases for each matching patient and retrieve data on pre-existing diseases to check inclusion and exclusion criteria for each patient. For instance, many trials recruit patients with initial diagnosis, therefore it needs to be verified whether this diagnosis was established in the past. Output of this report should be pseudonymized to protect patient data. The number of patients on this HIS report for time span T is denoted as nT . Under the assumptions that average patient accrual rate does not change over time and the HIS report identifies exactly all eligible patients, estimated patient accrual rate would be nT /|T| , where |T| denotes the length of time span T. However, typically only a subset of information required for inclusion and exclusion is available as coded HIS data items. Therefore only a subset of nT matches all inclusion and exclusion criteria for a specific trial. A manual expert review of a sample with sT patient records from the HIS report results in mT matching patients.
Manual review of HIS patient records requires access to identifiable patient data, therefore it needs to be compliant with data protection laws. Physicians with direct involvement into patient care are allowed to access records of their patients. Therefore these physicians get a list of pseudonyms from the HIS report, which enables them to access those patient records. They report for each pseudonym, whether this patient is eligible for the trial without disclosure of the person’s identity. In general, data access policies must be approved by the responsible data protection officer. Before patient accrual rate is estimated, we propose to assess, whether the probability of HIS patients actually matching to the trial is constant over time. Therefore the number of matching and non-matching patients is figured out in a contingency table for a set of predefined sub-intervals t of time span T. Our null hypothesis states that the proportion of matching patients among all reviewed sample patients mt /st is constant for all sub-intervals t and is tested by Pearson’s Chi-squared test. If the null hypothesis is not rejected (p > 0.05), we conclude that the probability of HIS patients matching to the trial is constant in time and estimate patient accrual rate (PAR) in the total time span T as follows: PAR = (mT /sT) * (nT /|T|)
(1)
A confidence interval for the expected PAR can be calculated according to Clopper [6], as
Table 1 A HIS report generates monthly lists of potential trial patients for an atrial fibrillation trial (second column). Experts reviewed manually medical records from these persons and identified matching patients for the trial (third column). Overall, 304 of 544 (56%) of HIS report patients were suitable for the trial.
month
number of patients in HIS report per month (nt = st)
number of matching patients from manual expert review per month (mt)
November 2007
79
71
December 2007
60
55
January 2008
76
62
February 2008
90
71
March 2008
70
21
April 2008
96
21
May 2008
73
3
total
nT = sT = 544
Methods Inf Med 3/2009
mT = 304
implemented in R-function binom.test [5]. Specifically, we assume a fixed rate nT /|T|. The supposed rate is multiplied by the calculated confidence interval of the probability of HIS patients actually matching to the trial.
Results We use datasets from ongoing Münster atrial fibrillation trials [7] and leukemia trials [8, 9] to demonstrate this method of patient accrual rate estimation. A HIS-based notification system generated HIS reports for study physicians, who manually reviewed patient records to assess trial eligibility [4].
Example 1: Atrial Fibrillation Trial 씰Table 1 presents number of patients iden-
tified by a HIS report and number of matching patients identified by manual expert review. The HIS report queried diagnosis code (I48.11 or I48.0) for the department of cardiology. In this example, all patients listed on the report were analyzed manually, i.e. sT = nT . Within seven months (November 2007 to May 2008) 544 patients were found in the HIS report; all these patients were reviewed manually and 304 matching patients were found, i.e. T = [November 2007; May 2008], nT = 544, mT = 304, sT = 544. When looking at data values of Table 1, it is striking that the number of matching patients is very low in March, April and May 2008. Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a highly significant p-value (p < 2.2E-16), therefore our estimation formula 1 cannot be applied in this example.
Example 2: Leukemia Trial In analogy to example 1, 씰Table 2 presents number of patients identified by HIS report-1 and associated number of matching patients. This report queried diagnosis code (C92.0-, C92.00 or C92.01) for the department of oncology. Again, all patients were analyzed manually. Within six months (April 2008 to September 2008) 283 patients were listed in HIS report-1. Twenty-eight match© Schattauer 2009
M. Dugas et al.: Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems
ing patients were identified by manual review, i.e. T = [April 2008; September 2008], nT = 283, mT = 28, sT = 283. Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a non-significant p-value (p = 0.60), therefore our estimation formula 1 can be applied. Formula 1 yields an estimated patient accrual rate PAR = 4.67/month with a 95% confidence interval (3.15/month; 6.59/month). When comparing Table 1 and Table 2 it is striking that the overall proportion of matching patients is much lower in Table 2. Therefore we applied an improved HIS report-2 which eliminated persons with previous leukemia episodes as well as duplicate persons (씰Table 3). With this improved report, nT was reduced (nT = sT = 53) for the same number of matching patients (mT = 28), i.e. 53% of HIS report-2 patients were suitable for the trial. Again, Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a non-significant p-value (p = 0.13), therefore our estimation formula 1 can be applied. Formula 1 yields an estimated patient accrual rate PAR = 4.67/month with a 95% confidence interval (3.41/month; 5.89/month).
Discussion In Germany and many other countries, electronic HIS are available in almost all hospitals. Initially, they were implemented for administrative purposes (billing, DRG system), but in recent years more and more clinical information is available in these systems. Due to deficiencies in data monitoring and software validation they are at present not suited for documentation of clinical trials, but they contain relevant information, such as diagnosis codes, which can be used to support patient recruitment [4]. Estimation of realistic patient accrual rates is important for planning of clinical trials, but quite difficult. The phenomenon that patient recruitment often takes much more time than investigators expected is called “Lasagna’s Law” [10] (Louis Lasagna, clinical pharmacologist, investigator of the placebo response). Collins [11] wrote about “fantasy and reality” of patient recruitment and concluded “we cannot overemphasize the importance of paying adequate attention to © Schattauer 2009
Table 2 Leukemia trial. HIS report-1 selects potential trial patients based on ICD codes (second column). Matching patients were identified by manual review of medical records (third column). Overall, only 28 of 283 (9.9%) of HIS report-1 patients were suitable for the trial.
month
number of patients in HIS report-1 per month (nt = st)
number of matching patients from manual expert review per month (mt)
April 2008
49
2
Mai 2008
30
5
June 2008
52
5
July 2008
63
6
August 2008
47
5
September 2008
42
5
total
nT = sT = 283
mT = 28
Table 3 Leukemia trial. In contrast to Table 2, HIS report-2 eliminates persons with previous leukemia episodes as well as duplicate persons (second column). Matching patients were identified by manual review of medical records (third column, same as in Table 2). Overall, 28 of 53 (53%) of HIS report-2 patients were suitable for the trial.
month
number of patients in HIS report-2 per month (nt = st)
number of matching patients from manual expert review per month (mt)
April 2008
6
2
Mai 2008
10
5
June 2008
13
5
July 2008
6
6
11
5
7
5
August 2008 September 2008
total
nT = sT = 53
sample size calculations and patient recruitment during the planning process. A sample size that is too small may turn a potentially important study into one that is indecisive or even an utter failure”. There is a lot of evidence that many clinical trials failed behind their recruitment objectives [1, 2]. Data monitoring committees must frequently decide about actions in trials with lower-thanexpected accrual [12]. Carter [13] stated “the most complicated aspect pertaining to the estimation of accrual periods is the determination of the expected rate”. HIS statistics can be used to estimate annual case numbers for a specific disease. However, this approach lacks precision, because due to specific inclusion and exclusion criteria only a subset of these patients is
mT = 28
eligible for a certain trial. Depending on these criteria, the rate of suitable patients within a certain disease may vary considerably. For this reason we combine a HIS report with manual expert review of patient records to estimate possible accrual rates more precisely. Manual chart review is labor-intensive; especially when nT is large, analysis of a sample sT (sT << nT) may provide an acceptable estimation of accrual rate. Our second example (씰Table 2) demonstrates that simple HIS statistics like annual case numbers for a certain disease can substantially overestimate patient accrual rates. HIS are case-centric, not patient-centric: For example, an AML patient with six chemotherapy cycles may be represented in six HIS cases. Therefore it is necessary to design HIS Methods Inf Med 3/2009
265
266
M. Dugas et al.: Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems
reports that aggregate information on several cases by patient. Appropriate HIS reports for clinical data retrieval are non-trivial, in particular if temporal relations are taken into account [14]. 씰Table 3 presents output from a more complex HIS report, which provides more accurate data. Patient lists from HIS reports can be pseudonymized easily, for instance using reference numbers instead of patient names. In contrast, manual expert review of individual patient records implicates access to identifiable data, because these patient charts contain a large proportion of unstructured text elements (e.g. physician letters). Therefore data protection rules need to be applied. According to German law, a physician at a university hospital who is involved in care for a specific patient, is allowed to analyze this data for scientific purposes. In general, the data access policy for patient data needs to be approved by the responsible data protection officer. Our approach does not take into account that only a subset of eligible patients decide to participate in a trial. This rate also depends on many factors, which are difficult to be quantified, such as assessment of risks and benefits, motivation and beliefs of physician, as well as organizational infrastructure of a specific trial. Our method depends on quality of HIS data. Complete and correct HIS data is needed for precise accrual rate estimations. Diagnosis codes play a key role for inclusion and exclusion of patients in clinical trials. The same data is very relevant for billing purposes in a DRG system, therefore – at least in Germany – these data items are monitored intensively by physicians, hospital administration and health insurances. For various reasons – for instance change of healthcare service structures or disease incidence – patient accrual rates can change over time. Therefore we propose to assess proportions of matching and non-matching patients over time by means of Pearson’s Chi-
Methods Inf Med 3/2009
squared test. Our first dataset was not appropriate for patient accrual rate estimation because of outliers, while our second example was suitable for this procedure. Secondary use of routine HIS data for scientific purposes becomes attractive, because systems get mature and the amount of available data is growing over time. A recent study reports that a large proportion of data for trials can be derived from routine data [15], leading to the visionary concept of “single source”, i.e. using HIS data directly for clinical trials [16].
Conclusion HIS-based estimation of patient accrual rates is feasible and should be applied to improve planning of clinical trials.
References 1. Charlson ME, Horwitz RI. Applying results of randomised trials to clinical practice: impact of losses before randomisation. Br Med J (Clin Res Ed) 1984; 289 (6454): 1281–1284. 2. Campbell MK, Snowdon C, Francis D, Elbourne D, McDonald AM, Knight R, Entwistle V, Garcia J, Roberts I, Grant A. Recruitment to randomised trials: strategies for trial enrolment and participation study. The STEPS study. Health Technol Assess 2007; 11 (48). 3. Mapstone J, Elbourne D, Roberts. Strategies to improve recruitment to research studies (Review). Cochrane Database Syst Rev 2007; (2): MR000013. 4. Dugas M, Lange M, Berdel WE, Müller-Tidow C. Workflow to improve patient recruitment for clinical trials within hospital information systems – a case-study. Trials 2008; 9: 2. 5. R: A language and environment for statistical computing. http://www.R-project.org. 6. Clopper CJ, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 1934; 26: 404–413. 7. Kirchhof P, Auricchio A, Bax J, Crijns H, Camm J, Diener HC, Goette A, Hindricks G, Hohnloser S, Kappenberger L, Kuck KH, Lip GY, Olsson B, Meinertz T, Priori S, Ravens U, Steinbeck G, Svernhage E, Tijssen J, Vincent A, Breithardt G. Outcome pa-
rameters for trials in atrial fibrillation: executive summary. Eur Heart J 2007; 28 (22): 2803–2817. 8. Büchner T, Hiddemann W, Berdel WE, Wörmann B, Schoch C, Fonatsch C, Löffler H, Haferlach T, Ludwig WD, Maschmeyer G, Staib P, Aul C, Gruneisen A, Lengfelder E, Frickhofen N, Kern W, Serve HL, Mesters RM, Sauerland MC, Heinecke A; German AML Cooperative Group. 6-Thioguanine, cytarabine, and daunorubicin (TAD) and highdose cytarabine and mitoxantrone (HAM) for induction, TAD for consolidation, and either prolonged maintenance by reduced monthly TAD or TAD-HAM-TAD and one course of intensive consolidation by sequential HAM in adult patients at all ages with de novo acute myeloid leukemia (AML): a randomized trial of the German AML Cooperative Group. J Clin Oncol 2003; 21 (24): 4496–4504. 9. Büchner T, Berdel WE, Schoch C, Haferlach T, Serve HL, Kienast J, Schnittger S, Kern W, Tchinda J, Reichle A, Lengfelder E, Staib P, Ludwig WD, Aul C, Eimermacher H, Balleisen L, Sauerland MC, Heinecke A, Wörmann B, Hiddemann. Double induction containing either two courses or one course of high-dose cytarabine plus mitoxantrone and postremission therapy by either autologous stem-cell transplantation or by prolonged maintenance for acute myeloid leukemia. J Clin Oncol 2006; 24 (16): 2480–2489. 10. Lasagna L. Problems in publication of clinical trial methodology. Clin Pharmacol Ther 1979; 25 (5 Pt 2): 751–753. 11. Collins JF, Williford WO, Weiss DG, Bingham SF, Klett C. Planning patient recruitment: fantasy and reality. Stat Med 1984; 3 (4): 435–443. 12. Korn EL, Simon R. Data monitoring committees and problems of lower-than-expected accrual or events rates. Control Clin Trials 1996; 17 (6): 526–535. 13. Carter RE, Sonne SC, Brady KT. Practical considerations for estimating clinical trial accrual periods: application to a multi-center effectiveness study. BMC Medical Research Methodology 2005; 5: 11. 14. Dorda W, Gall W, Duftschmid G. Clinical data retrieval: 25 years of temporal query management at the University of Vienna Medical School. Methods Inf Med 2002; 41 (2): 89–97. 15. Williams JG, Cheung WY, Cohen DR, Hutchings HA, Longo MF, Russell IT. Can randomised trials rely on existing electronic data? A feasibility study to explore the value of routine data in health technology assessment. Health Technol Assess 2003; 7 (26): iii, v-x, 1–117. 16. Kush R, Alschuler L, Ruggeri R, Cassells S, Gupta N, Bain L, Claise K, Shah M, Nahm M. Implementing Single Source: the STARBRITE proof-of-concept study. J Am Med Inform Assoc 2007; 14 (5): 662–673.
© Schattauer 2009
Original Articles
© Schattauer 2009
How Do Cancer Registries in Europe Estimate Completeness of Registration? I. Schmidtmann; M. Blettner Institute of Medical Biostatistics, Epidemiology and Informatics, University of Mainz, Mainz, Germany
Keywords Cancer registry, quality indicators, survey
Summary Objectives: Several methods for estimating completeness in cancer registries have been proposed. Little is known about their relative merits. Before embarking on a systematic comparison of methods we wanted to know which indicators were currently in use and whether there had been comparative investigations of estimation methods. Methods: We performed a survey among European cancer registries asking which methods for estimating completeness they used and whether they had performed comparisons of methods. Results: One hundred and ninety-five European cancer registries were contacted after identification using membership directories of the European Network of Cancer Registries (ENCR) and of the International Association of Cancer Registries (IACR). Fifty-six (29%;
Correspondence to: Irene Schmidtmann Institute of Medical Biostatistics, Epidemiology and Informatics University of Mainz 55101 Mainz Germany E-mail: [email protected]
Introduction The purpose of population-based cancer registries is to estimate the cancer burden in their area, to observe trends and regional differences and to provide a database for epidemiological research. Population-based cancer registries can only reach their goals when they are complete or to a large extent complete. An incomplete cancer registry is only of limited use. First, the cancer incidence is underesti-
22%–36%) of the 195 cancer registries replied. Forty-eight (86%; 74%–94%) of these stated that they estimated completeness. Thirty-five (73%; 58%–85%) used historic comparisons, 31 (65%; 49%–78%) compared their data with a reference registry, 28 (58%; 43%–72%) registries used mortality incidence ratio. Capture-recapture methods were applied in only 12 (25%; 14%-40%) registries. The flow method was used by ten (21%; 10%–35%) registries. There were regional differences in the use of methods. Comparisons of methods were rare and usually restricted to real data at hand. A systematic comparison including all indicators actually in use in cancer registries was not reported. Conclusions: A comparison of methods under well defined realistic conditions seems to be indicated. Unifying the methods for estimating completeness would improve validity of comparisons between cancer registries.
Methods Inf Med 2009; 48: 267–271 doi: 10.3414/ME0559 received: April 4, 2008 accepted: September 20, 2008 prepublished: March 31, 2009
mated. Second, it is also possible that some subgroups of cancer patients are more likely to be registered than others, thereby misrepresenting the distribution of age or stage of disease or the regional distribution of cases. Third, trends in completeness of registration may be mistaken for trends in incidence. Researchers can assess whether cancer registry data are representative and useful for their purpose if an estimate of completeness of registration is available. Several of the Ger-
man cancer registries still have to show that they operate effectively. One of the indicators used to measure effectiveness is completeness of registration. Outside Germany and the United Kingdom not many cancer registries publish indicators for completeness regularly.
Available Methods for Estimating Completeness Completeness is defined as the proportion of diagnosed cancer cases that are registered. Several methods to assess completeness have been proposed [1]. 1. Completeness can be estimated using the proportion of cases for which the first and possibly only source of information is a death certificate (DCN cases). An estimate of completeness is then given by
where DCN denotes the proportion of DCN cases and M : I denotes the mortality to incidence ratio [1] which can be computed as the number of deaths from a particular cancer in a given year divided by the number of incident cases for the same cancer in this year. 2. Comparing current incidence rates or numbers of cases registered to appropriate numbers from the past within the same registry is termed historic data method [1]. Observed trends can be taken into account. 3. Completeness can also be estimated by comparing figures from the cancer registry under consideration with data from a presumably complete reference registry. This can be done by simply comparing incidence rates. a) In a more rigorous approach, the number of cases can be estimated using Methods Inf Med 3/2009
267
268
I. Schmidtmann; M. Blettner: How Do Cancer Registries in Europe Estimate Completeness of Registration?
age-specific incidence rates of the reference registry and the demographic structure of the registry population under consideration. b) The expected number of cases is estimated from the mortality rates in the registry population under consideration and the incidence and mortality rates in the reference registry population. 4. Capture-recapture methods have been used to estimate completeness of disease registries [2–5], although their usefulness has been questioned [6, 7] and some problems are known [8]. 5. Independent case ascertainment [1] can be regarded as a special case of capturerecapture methods. An independent data base of cancer cases may be available, e.g. data collected for a clinical or epidemiological study or data obtained for administrative purposes. This database and the registry database can be linked – given confidentiality issues do not prohibit the linkage. The proportion of cancer cases in the independent database that are also known to the cancer registry can serve as an estimate of completeness. 6. The flow method by Bullard et al. [9] uses three time-dependent probabilities: the Region of registry
probability that a patient survives until time t, the probability that the death certificate of a patient who dies in the interval after time t mentions cancer, and the probability that a patient survives until time t and remains unregistered. From these the completeness at any time can be estimated. 7. The method proposed by Haberland [10, 11] is based on a paper by Colonna et al. [12] in which cancer incidence rates for the whole of France were estimated from incidence rates from regional registries covering only part of the country and both national and regional mortality rates. This method is also based on mortality incidence ratios (M/I) but extends the simple rule of proportion by introducing log-linear models for trends of mortality and incidence. The proposed methods have as prerequisites certain assumptions such as independence of sources, constant trend, or similarity of cancer risk and prognosis in compared regions. However, these assumptions are rarely, if ever, met in practice. Little is known about the merits and weaknesses of these procedures in realistic settings. Mattauch [13] performed a comparison of methods using data from the Münster cancer registry and found
Responders
Number of registries in region
N
%
East (Belarus, Bulgaria, Czech Republic, Hungary, Poland, Republic of Moldova, Romania, Russian Federation, Slovakia, Ukraine; Armenia, Georgia, Kyrgyzstan, Ukraine)
5
18
North (Channel Islands, Denmark, Estonia, Faeroe Islands, Finland, Iceland, Ireland, Isle of Man, Latvia, Lithuania, Norway, Sweden, United Kingdom of Great Britain and Northern Ireland; Bermuda)
15
50
30
South (Albania, Andorra, Bosnia and Herzegovina, Croatia, Gibraltar, Greece, Holy See, Italy, Malta, Montenegro, Portugal, San Marino, Serbia, Slovenia, Spain, The former Yugoslav Republic of Macedonia; Cyprus, Turkey)
9
13
69
West (Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland)
27
40
68
All regions
56
29
195
Methods Inf Med 3/2009
Table 1 Response rates by region
considerable differences between methods in some instances. Silcocks and Robinson [14] compared the flow method and capturerecapture methods in a simulation study with focus on the confidence intervals obtained. They found that point estimates for completeness differed although width of confidence intervals was similar. They concluded that one or possibly both approaches may not be appropriate and advocated the use of the flow method. Recently they also compared the flow method and the DCN method using Ajiki’s formula [15, 16] in a simulation study [17]. They found that the flow method gave more realistic results than the DCN method which grossly underestimated completeness. We wanted to obtain an overview over methods currently in use in cancer registries in Europe. The main objective of the research reported here was to find out whether cancer registries a) actually estimated completeness, b) if so which methods they used and how frequently, and c) if not why not. Additionally, we were interested in the software used and in modifications of published methods. A further aim was to collect information on unpublished methodological and comparative investigations. The survey was intended as a starting point for a more methodological research project. An in-depth discussion of merits and weaknesses of the proposed estimation methods is beyond the scope of this paper. The latter is addressed in [18].
28
Methods For practical purposes the survey was restricted to European cancer registries, European being defined by membership in the ENCR or by being mentioned in the European section of the IACR [19]. A short questionnaire was developed, mainly including questions on methods used for reporting and calculating completeness. We asked about the DCN-method, historical comparison, comparison with a reference registry, independent case ascertainment, flow method [9], method based on mortality incidence ratio as described by Colonna et al. [12], and the capture-recapture method. Further methods could be specified. We also asked for the frequency of use (never, routinely, on special occasion). As several respondents ticked the boxes “routine use” or “on © Schattauer 2009
I. Schmidtmann; M. Blettner: How Do Cancer Registries in Europe Estimate Completeness of Registration?
special occasion” but left the others empty, we interpreted missing values as “never”. Additionally we asked for availability of software, performance of method comparisons, references, contact details and interest in feedback. We first distributed the questionnaire among the 11 German population-based cancer registries. After receiving responses from German registries we changed the layout slightly. In April 2005, we sent questionnaires to all 153 non-German cancer registries listed in [20] and asked for completion. The questionnaires were sent with a cover letter that explained the research project using stationery of the Cancer Registry Rhineland Palatinate. In August 2006, the list was updated using the membership directories of the ENCR [21] and the European section of the IACR [19]. We contacted all registries in the updated list containing 195 addresses, unless they had already returned the questionnaire. This reminder was sent by E-mail when an E-mail address was available, otherwise by ordinary mail. The returned questionnaires were entered into an Access database. The analysis is based on all questionnaires received up to October 2006. We computed appropriate absolute and relative frequencies. We also give 95% exact confidence intervals for relative frequencies. Statistical analysis was performed using SAS 9.1 after converting the Access database into a SAS data set. For regional analyses, the group definition by the United Nations Population Division [22] was adapted as some of the ENCR members belong to Asia or America according to this classification. Asian countries formerly belonging to the Soviet Union are added to the Eastern Europe category, Turkey and Cyprus are added to Southern Europe category, and Bermuda is combined with the United Kingdom (씰see Table 1).
Results Only 56 registries (29%; 22%–36%) returned the questionnaire. The response rate was higher in Northern and Western Europe. Forty-eight (86%; 74%–94%) of these registries stated that they estimated completeness, 8 (14%; 6%–26%) reported that they did not estimate completeness. Three of the latter deemed it unnecessary. The follow© Schattauer 2009
Table 2 Frequency of use of various methods (multiple answers possible) Method
Method is applied
Number of registries
no information or never
routine use
special occasion
N
%
N
%
N
%
N
Bullard
38
79
5
10
5
10
48
Capture/recapture
36
75
5
10
7
15
48
DCN method
22
46
14
29
2
4
48
Historical comparison
15
31
35
73
0
0
48
independent case ascertainment
28
58
9
19
13
27
48
Mortality/incidence ratio (Colonna)
15
31
18
38
10
21
48
Comparison with reference registry
19
40
24
50
7
15
48
ing other reasons for not estimating completeness were stated by one or two registries each: takes too much time, no software available, nobody in the registry capable of doing it, incidence comparison with other registries is performed, all pathologists contribute, other priorities and limited staffing, new registry. The proportion of registries using the various methods can be seen in 씰Table 2. Most registries reported that they used historical comparisons routinely. Many registries compared their data with a reference registry, mainly as a routine exercise. Mortality incidence ratio was a commonly used indicator, whereas independent case ascertainment and other capture-recapture methods tended to be used on special occasions. The flow method was only used in ten registries. There were regional differences in the choice of method (씰see Fig. 1). While most responding cancer registries in Southern and Northern Europe (n = 18, 86%; 64%–97%) reported the use of historical comparisons only 13 (59%; 36%–79%) of the cancer registries in Western Europe did so. The DCNmethod seems to be popular in Eastern Europe – four of five replying registries used it – whereas in Western Europe only four (18%; 5%–40%) of the responding registries mentioned using it. The flow method and capture-recapture methods were not used in the responding Eastern European registries at all, whereas the flow method was used by six (43%; 18%–71%) of the Northern European
registries – particularly in the United Kingdom. Few registries confirmed the availability of software. Twelve (25%; 14%–40%) cancer registries stated to have software for comparison with a reference registry, ten (21%; 10%–35%) of the registries quoted software for historical comparisons or for mortality incidence ratio. Eight (17%; 7%–30%) registries had access to software for the flow method. Software for the DCN method was mentioned by six (13%; 5%–25%) registries, independent case ascertainment and capturerecapture by four (8%; 2%–20%) and other methods by three (6%; 1%–17%). Modifications of published methods were rarely mentioned. Two registries compared current M/I ratios with M/I ratios from previous years. One registry explicitly stated to compare M/I ratio with those from other registries. Estimation of completeness was mostly performed by epidemiologists (n = 30, 63%; 47%–76%) or by statisticians (n = 23, 48%; 33%–63%). Medical doctors (n = 13, 27%; 15%–42%), computer scientists (n = 10, 21%; 10%–35%) and cancer registrars (n = 7, 15%; 6%–28%) were also involved. Only 14 (29%; 17%–44%) of the cancer registries that performed estimation of completeness had ever applied more than one method to a dataset and compared results. When asked for publications 21 (44%; 29%–59%) cancer registries referred to techMethods Inf Med 3/2009
269
270
I. Schmidtmann; M. Blettner: How Do Cancer Registries in Europe Estimate Completeness of Registration?
Fig. 1 Proportion of cancer registries within each region using various methods of estimating completeness
nical reports, 11 (23%; 12%–37%) mentioned publication in peer-reviewed journals and five (10%; 3%–23%) other publications. Fifteen (31%; 19%–46%) registries answered that they had not published results concerning estimation of completeness. Some registries provided useful references to their own work concerning estimation of completeness or gave information on ongoing work in this field, e.g. [14, 23–30].
Discussion Summary of Results Most of the cancer registries that completed the questionnaire estimate completeness in some way. Simple methods such as DCNmethod, historical comparisons, comparison with reference registry and M/I ratio are preferred. Modification or extension of published methods are rarely stated. Within the cancer registries methodological developments and comparisons are rare and if they have been performed they have been based on real registry data at hand.
Methods Inf Med 3/2009
Limitations The generalization of our results is limited as the response rate is low and varies between European regions. A fairly large proportion of cancer registries from Germany, the Netherlands, Scandinavia and the United Kingdom responded. These cancer registries are however generally reasonably well funded and cover major parts of the population. On the other hand, response rates are very low for registries in Eastern and Southern Europe, many of which are small, restricted to a small population or to a small selection of diseases, or have to function on limited funds. The more sophisticated methods tend to be used more frequently in Western Europe and Northern Europe. It may be speculated that small registries and registries with little resources have not responded. Such registries presumably would prefer simple methods to estimate completeness or not estimate it at all. This would mean that the use of the more sophisticated procedures was overestimated. Although there had been a pre-test, sometimes questions where not understood as they had been intended. Some registries reported plans rather than the current status. From further comments given, we concluded that sometimes by “method based on M/I ratio”
only the simple M/I ratio instead of the model-based procedure was meant. The availability of software was often negated although the respective method was used. From free text notes it may be concluded that in these cases standard software such as Excel or statistical packages such as STATA or SAS are used.
Conclusion This survey confirms the impression gained from literature search: there are very different methods in use and there are hardly any comparative studies about the performance of indicators of completeness. The investigations found in the literature and indicated in this survey apply several indicators for completeness to the same real dataset. However, it is unsatisfactory that applying several indicators may yield considerably different results [13]. A systematic comparison including all indicators actually in use in cancer registries has not been performed so far. So a method comparison under realistic and well defined conditions extending or complementing the work by Silcocks and Robinson [14, 17] seems to be indicated. Unifying the methods for estimating completeness would improve validity of comparisons of completeness between © Schattauer 2009
I. Schmidtmann; M. Blettner: How Do Cancer Registries in Europe Estimate Completeness of Registration?
cancer registries. This can be targeted once the relative merits of the procedures in use have been assessed.
Acknowledgments The authors wish to thank all colleagues who took the time to complete the questionnaire. The technical assistance of Ulrike Knoll, Dagmar Lautz, and Lamia Yousif is gratefully acknowledged. We would also like to thank Claudia Spix for critically reading and commenting on an earlier draft of this paper. This work contains part of the first author’s PhD thesis under preparation.
References 1. Parkin DM, Chen VW, Ferlay J, Galceran J, Storm HH, Whelan SL. Comparability and Quality Control in Cancer Registration. Lyon: IARC Press; 1994. 2. Robles SC, Marrett LD, Clarke EA, Risch HA. An application of capture-recapture methods to the estimation of completeness of cancer registration. J Clin Epidemiol 1988; 41 (5): 495–501. 3. Brenner H, Stegmaier C, Ziegler H. Estimating completeness of cancer registration in Saarland/ Germany with capture-recapture methods. Eur J Cancer 1994; 30A (11): 1659–1663. 4. Brenner H, Stegmaier C, Ziegler H. Estimating completeness of cancer registration: an empirical evaluation of the two source capture-recapture approach in Germany. J Epidemiol Community Health 1995; 49 (4): 426–430. 5. Berghold A, Stronegger WJ, Wernecke KD. A model and application for estimating completeness of registration. Methods Inf Med 2001; 40 (2): 122–126. 6. Schouten LJ, Straatman H, Kiemeney LA, Gimbrere CH, Verbeek AL. The capture-recapture method for estimation of cancer registry completeness: a useful tool? Int J Epidemiol 1994; 23 (6): 1111–1116. 7. Tilling K. Capture-recapture methods – useful or misleading? Int J Epidemiol 2001; 30 (1): 12–14. 8. Brenner H. Application of capture-recapture methods for disease monitoring: potential effects of
© Schattauer 2009
imperfect record linkage. Methods Inf Med 1994; 33 (5): 502–506. 9. Bullard J, Coleman MP, Robinson D, Lutz JM, Bell J, Peto J. Completeness of cancer registration: a new method for routine use. Br J Cancer 2000; 82 (5): 1111–1116. 10. Haberland J, Bertz J, Görsch B, Schön D. Krebsinzidenzschätzungen für Deutschland mittels log-linearer Modelle. Gesundheitswesen 2001; 63 (8–9): 556–560. 11. Haberland J, Schön D, Bertz J, Görsch B. Vollzähligkeitsschätzungen von Krebsregisterdaten in Deutschland. Bundesgesundheitsblatt, Gesundheitsforschung, Gesundheitsschutz 2003; 46 (9): 770–774. 12. Colonna M, Grosclaude P, Faivre J, Revzani A, Arveux P, Chaplain G, et al. Cancer registry data based estimation of regional cancer incidence: application to breast and colorectal cancer in French administrative regions. J Epidemiol Community Health 1999; 53 (9): 558–564. 13. Mattauch V. Strukturanalyse des Epidemiologischen Krebsregisters für den Regierungsbezirk Münster: Identifikation von Schwachstellen und Konzepte für eine kontinuierliche Qualitätssicherung. Münster: Inauguraldissertation. Medizinische Fakultät der Westfälischen Wilhelms-Universität Münster; 2005. 14. Silcocks PB, Robinson D. Completeness of ascertainment by cancer registries: putting bounds on the number of missing cases. J Public Health 2004; 26 (2): 161–167. 15. Ajiki W, Tsukuma H, Oshima A. Index for evaluating completeness of registration in populationbased cancer registries and estimation of registration rate at the Osaka Cancer Registry between 1966 and 1992 using this index (in Japanese). Nippon Koshu Eisei Zasshi 1998; 45 (10): 1011–1017. 16. Kamo K, Kaneko S, Satoh K, Yanagihara H, Mizuno S, Sobue T. A mathematical estimation of true cancer incidence using data from population-based cancer registries. Jpn J Clin Oncol 2007; 37 (2): 150–155. 17. Silcocks PB, Robinson D. Simulation modelling to validate the flow method for estimating completeness of case ascertainment by cancer registries. J Public Health 2007; 29 (4): 455–462. 18. Schmidtmann I. Estimating completeness in cancer registries – comparing capture-recapture methods in a simulation study. Biom J 2008; 50 (6): 1077–1092.
19. International Association of Cancer Registries. IACR Membership List. International Association of Cancer Registries Homepage 2006 (cited 2006 Aug 9). Available from: URL: http://www.iacr.com. fr 20. Tyczynski JE, Demaret E, Parkin DM. Standards and Guidelines for cancer registration. Lyon: IARC Press; 2003. 21. European Network of Cancer Registries. ENCR Membership List. European Network of Cancer Registries Homepage 2006 (cited 2006 Aug 9). Available from: URL: http://www.encr.com.fr/ 22. UN Population Division. Definition of Major Areas and Regions. UN Population Division Homepage 2007 (cited 2007 Aug 10). Available from: URL: http://esa.un.org/unpp/index.asp?panel=5 23. Crocetti E, Miccinesi G, Paci E, Zappa M. An application of the two-source capture-recapture method to estimate the completeness of the Tuscany Cancer Registry, Italy 4. Eur J Cancer Prev 2001; 10 (5): 417–423. 24. Schouten LJ, Hoppener P, van den Brandt PA, Knottnerus JA, Jager JJ. Completeness of cancer registration in Limburg, The Netherlands. Int J Epidemiol 1993; 22 (3): 369–376. 25. Brewster DH, Crichton J, Harvey JC, Dawson G. Completeness of case ascertainment in a Scottish regional cancer registry for the year 1992. Public Health 1997; 111 (5): 339–343. 26. Lund E. Pilot study for the evaluation of completeness of reporting to the cancer registry. Incidence of Cancer in Norway 1978. Oslo: The Cancer Registry of Norway; 1981. pp 11–14. 27. Helseth A, Langmark F, Mork SJ. Neoplasms of the central nervous system in Norway. I. Quality control of the registration in the Norwegian Cancer Registry. APMIS 1988; 96 (11): 1002–1008. 28. Mork J, Thoresen S, Faye-Lund H, Langmark F, Glattre E. Head and neck cancer in Norway. A study of the quality of the Cancer Registry of Norway’s data on head and neck cancer for the period 1953-1991. APMIS 1995; 103 (5): 375–382. 29. Harvei S, Tretli S, Langmark F. Quality of prostate cancer data in the cancer registry of Norway. Eur J Cancer 1996; 32A (1): 104–110. 30. Tingulstad S, Halvorsen T, Norstein J, Hagen B, Skjeldestad FE. Completeness and accuracy of registration of ovarian cancer in the cancer registry of Norway1. Int J Cancer 2002; 98 (6): 907–911.
Methods Inf Med 3/2009
271
272
© Schattauer 2009
Original Articles
A Framework for Representation and Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas S. Hacker; H. Handels Department of Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
Keywords 3D-visualization, anatomical atlas, medial representation, medical education, shape modeling, statistical shape models
Summary Objectives: Computerized anatomical 3D atlases allow interactive exploration of the human anatomy and make it easy for the user to comprehend complex 3D structures and spatial interrelationships among organs. Besides the anatomy of one reference body inter-individual shape variations of organs in a population are of interest as well. In this paper, a new framework for representation and visualization of 3D shape variability of anatomical objects within an interactive 3D atlas is presented. Methods: In the VOXEL-MAN atlases realistic 3D visualizations of organs in high quality are generated for educational purposes using volume-based object representations. We extended the volume-based representation of organs to enable the 3D visualization of organs’ shape variability in the atlas. Therefore, the volume-based representation of the inner organs in the atlas is combined with a medial
Correspondence to: Prof. Dr. Heinz Handels Department of Medical Informatics University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg Germany E-mail: [email protected]
1. Introduction Computer-based 3D atlases of the human body have experienced a remarkable change during the last two decades. They have Methods Inf Med 3/2009
representation of organs of a population creating a compact description of shape variability. Results: With the framework developed different shape variations of an organ can be visualized within the context of a volume-based anatomical model. Using shape models of the kidney and the breathing lung as examples we demonstrate new possibilities such an approach offers for medical education. Furthermore, attributes like gender, age or pathology as well as shape attributes are assigned to each shape variant which can be used for selecting specific organs of the population. Conclusions: The inclusion of anatomical variability in a 3D interactive atlas presents considerable challenges, since such a system offers the chance to explore how anatomical structures vary in large populations, across age, gender and races, and in different disease states. The framework presented is a basis for the development of specialized variability atlases that focus e.g. on specific regions of the human body, groups of organs or specific topics of interest.
Methods Inf Med 2009; 48: 272–281 doi: 10.3414/ME0551 received: March 7, 2008 accepted: July 9, 2008 prepublished: March 31, 2009
evolved from simple anatomy atlases that were suited only for the interested amateur to highly professional tools for medical education and practice. Nowadays they are not only a valuable tool for learning and teaching
anatomy but also a reference used e.g. by health professionals. 3D models of the human body on which such atlases are based are typically constructed from 3D image data generated e.g. by computer tomography (CT) or magnetic resonance imaging (MRI). With the Visible Human project [1, 2] high-resolution cross-sectional photographic images became available which provide a basis for the generation of 3D models with a high degree of detail. Similar projects with photographic data sets of even higher resolutions followed, as e.g. the Chinese Visible Human [3] or the Visible Korean Human data set [4]. Using these data sets a variety of research projects for the development of 3D models and digital atlases of the human body came to be developed. Computerized anatomical atlases based on high-quality 3D models are a useful addition to printed anatomy atlases. In contrast to static knowledge representation in textbooks, they allow interactive exploration of the human anatomy and make it easier for the user to understand complex 3D structures and spatial interrelationships among organs. Besides the high quality and flexibility computer-based atlases have reached these days, they still have their limitations. In most cases 3D atlases are derived from a single individual, or a very small number of subjects, and are not representative of the human anatomy in general [5, 6]. They do not contain any information about how and to what extent anatomical structures vary between different people. Yet, in reality there are considerable differences regarding the shape and size of anatomical objects, which is partly due to natural variability, but may also be affected by factors like age, gender, ethnic background, diseases or habits. The investigation of anatomical shapes and their variability can improve the understanding of processes behind
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
growth, ageing or diseases and support medical diagnostics and therapy. Thus, the inclusion of variability into an anatomical atlas would be a great advancement and would be useful not only for students learning anatomy but also for medical experts. This paper presents a new approach representing inter- and intra-individual shape variability of anatomical objects in space and time within an interactive 3D atlas. The concept is explained and the possibilities of this approach are shown using shape models of the kidney and the lung as examples. Size and shape of anatomical structures and their variability have always played an important role in medical education and research. Knowledge about variability in anatomy has traditionally been transmitted by sets of illustrative examples, as in collections of pictures or preparations. Most of today’s digitized atlases which deal with anatomical variability use a similar approach like conventional atlas books, in that they usually show a number of normal and abnormal variations in schematic or photographic pictures accompanied by descriptive text. An example for such an approach is the “Multimedia Human Anatomic Variation Atlas” by Bergman et al. [7]. The integration of anatomical variability across whole populations in a 3D atlas still remains a challenging problem which is not yet generally solved. So far, most progress has been achieved for population-based atlases of the brain [8]. Here large sets of images are mapped into a common coordinate system using registration algorithms. The resulting deformation fields can then be used to create probability maps which retain information on anatomical variability. Such an approach is employed in an ongoing large-scale project of the International Consortium for Brain Mapping (ICBM) in which a probabilistic atlas of the human brain is being developed based on a large sample of normal individuals [9]. Also, a number of disease-specific atlases of the human brain have been developed using registration and warping algorithms which focus on revealing structural changes due to neurological diseases such as autism, schizophrenia, or Alzheimer’s disease [8, 10]. Variability atlases based on deformation fields of large populations are powerful research tools with a wide range of clinical and scientific applications. They contain information about positional variability at every © Schattauer 2009
voxel related to a common reference space and allow a comparative examination of the anatomy of a high number of individuals. However, as anatomical structures differ not only in shape and size but also in relative orientation and position to each other, different kinds of information are superimposed in the deformation fields. This can make it difficult to interpret the resultant probability maps. For an atlas which is focused on shape variability of individual anatomical structures rather than positional variability related to a global reference coordinate system, an organ-based approach using shape descriptors is suitable. For modeling and analysis of organ shapes a great number of shape descriptors and shape models have been proposed over the years like e.g. point distribution models [11], medial axes [12, 13], Fourier descriptors [14, 15], spherical harmonic functions [16], and active shape models [17]. Such methods have been applied mainly for model-based segmentation and classification of organ shapes. In our approach 3D shape models for representation of geometric variability are used in an interactive anatomical atlas as a new tool in medical education. To demonstrate the possibility and functionality of an interactive anatomical atlas visualizing 3D shape variations of organs we have generated statistical models of the left kidney and of the breathing lung as examples. The kidney model is based on a population of 48 kidneys and captures inter-individual variability. The lung model encodes intra-individual shape variation during the breathing cycle based on 4D CT image data acquired during free breathing. Here, a 4D dataset consists of ten 3D image datasets measured in different breathing phases. Hence, the segmented lungs of the 4D image data reflect intra-individual respiratory shape variations of the lungs, which are modeled by m-reps.
2. Material and Methods The basis of this work is the VOXEL-MAN atlas of the inner organs [5] which was developed at our department. It has been created from the photographic cross-sections and CT data of the Visible Human Male [1, 2]. A 3D model of the torso of the Visible Human has been built using color-space segmentation
and a matched volume visualization technique [18]. The model is characterized by a high level of detail and contains more than 650 3D anatomical structures. The segmented anatomical objects are textured with their original colors and their surfaces are visualized with subvoxel resolution. That way a nearly photo-realistic quality is achieved (씰Fig. 1). Since it is a volume-based model of the human body, external surfaces of an organ can be viewed as well as interior views can be generated using cutting planes. Nonsegmentable objects, like nerves and small blood vessels, were modeled artificially on the basis of landmarks present in the image volume. The spatial model is connected to a symbolic model which contains descriptive information about their anatomical structures. Interrelations between objects are described via a semantic network [19, 20]. Examples of relations are “is part of ”, “has part” or “is branching from”. The integrated knowledge base system allows a versatile object-based interaction with the spatial model, e.g. the 3D model can be interrogated or disassembled by addressing names of organs. The volume-based model allows a highly realistic visualization of anatomical structures, but it is not suitable for an efficient representation of anatomical variability based on larger populations. Therefore, we extended the VOXEL-MAN system by a shape description that is connected to the volume model. The choice of shape representation depends strongly on the type of application at hand. For modeling anatomical shapes and their variability in an interactive 3D atlas we have chosen to use the medial model representation called “m-rep” [21, 22]. We have chosen the m-rep description for the representation of shape variability within the presented framework for the following reasons: ● M-reps offer a compact description of shape. They represent characteristic shape properties in an efficient way using a limited number of parameters (compared e.g. to the high-dimensional feature space of a 3D deformation field). ● Geometric correspondence of shape variants can be established not only between surface points but also between volume points. This is an important property for our purpose as the shape description is to combine with a volume-based model. Methods Inf Med 3/2009
273
274
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
Fig. 1 3D models of inner organs in the VOXEL-MAN atlas [5] derived from the Visible Human data set. The representation exhibits a high degree of realism and detail. Yet, the model does not contain any information about inter- or intra-individual variability of human anatomy.
●
Most other shape descriptors like e.g. Fourier descriptors allow the establishment of correspondence on the object’s surface, only. The model parameters invoke an intuitive understanding of an object’s shape as they quantify terms like bending, thickness or elongation. This facilitates the interpretation of shape variability and allows researchers to argue about identified shape differences in anatomically meaningful terms of organ development and deformation.
Methods Inf Med 3/2009
2.1 Shape Representation with M-rep Models M-reps were introduced by Pizer [21, 22] for modeling, visualization and analysis of 2D and 3D objects and are mainly used in the medical field. An m-rep model is a discrete skeletal representation of an object based on the medial axis as proposed by Blum [23]. The basic components of an m-rep model are the medial figures that have a single, nonbranching medial surface and are represented as a mesh or chain of medial atoms. Simple
objects like the kidney in 씰Figure 2 can be described by a single medial figure, whereas more complex objects can be built as a collection of connected figures. In this paper only one-figure objects are addressed. The medial atoms are centers of inscribed spheres with two equal length boundarypointing arrows that are called “spokes”, at whose ends the implied boundary is to be orthogonal. A medial atom describes the local shape of the object. It is represented by the tuple m = (x, r, F, θ) that contains the atom’s position x ∈ ℜ 3, the local width r ∈ ℜ + (the radius of the inscribed sphere), an orthonormal local frame F ∈ ℜ 4 that describes the object’s local orientation and the object angle θ ∈[0, π/2]. The local frame F is parameterized by (b, b ⊥, n), where the vector n is normal to the medial manifold and b gives the direction in the tangent plane of the fastest narrowing of the object (씰Fig. 3, left). “End atoms”, i.e. atoms at the outer edge of the medial mesh, have an additional parameter η ∈ ℜ+ that describes the object’s local elongation at the boundary crest (씰Fig. 3, right). The tuples mi (i = 1, …, n; n: number of atoms in an m-rep) for all medial atoms are combined into a feature vector that represents a given organ shape and can be used for further analysis. Further details about m-reps can be found in [22]. The medial atoms provide a figural coordinate system, giving, first a position on the medial sheet (u, v), second a figural side t ∈ [–1, 1], and finally a relative figural distance along the appropriate medial spoke τ ∈ [–1, 0]. Each position on the surface or in the volume of a figure, or near the figure, has assigned coordinates that describe its relative position to the medial surface which is spanned by the medial mesh. A precondition for a comparative examination of shape variants is the establishment of an appropriate geometric correspondence between objects. In the case of m-reps, the correspondence is defined on the basis of the figural coordinate system. Points on different shape variants with the same figural coordinates (u, v, t, τ) are considered to be correspondent. However, correspondence can only be defined between m-rep models that have the same topology, i.e. the same number of medial figures and the same mesh structure.
© Schattauer 2009
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
2.2 Modeling of Organ Variability In the framework statistical models of the left kidney and of the breathing lung have been generated as examples. The kidney model is based on a population of 48 kidneys and captures inter-individual variability whereas the lung model encodes intra-individual shape variation during the breathing cycle. While the kidney model illustrates the inter-patient shape variability, the statistical model of the lung describes the intra-patient shape variability during the breathing process. For the kidney model we have used a population of 48 left kidneys which are based on CT data of the abdomen. An m-rep model of the kidney with a single medial figure and 3 × 5 medial atoms (씰Fig. 2) was fitted to each of the segmented kidneys. For the fitting process a semiautomatic multistage optimization procedure [22] was carried out using the software “Pablo” which was developed by the Medical Image and Display Group at the University of North Carolina. During this process an objective function is minimized which optimizes the global fit of the model to the segmented image object considered and at the same time increases the likelihood of point-by-point model correspondence from one image object to the next. The global fit is measured by the mean squared distance between the model surface and the boundary of the segmented object. The correspondence is improved by favoring meshes with configurations near the starting configuration and by producing well-behaved meshes, i.e. meshes with relatively evenly spaced medial atoms. Further details of the optimization procedure are described in [22]. Employing this fitting procedure we obtained for each shape variant an m-rep model with the same medial structure and assumed point-by-point correspondence between the individual models. After aligning the models by translation and rotation an average kidney has been calculated on the basis of the m-rep parameters (씰Fig. 4). During alignment of the models no scaling was done, i.e. size differences of the individual models were maintained. Object variants can be characterized by shape and size parameters (e.g. width, bending, widening) which can be derived from the m-rep parameters. For the kidney variants we have calculated 1) the length of © Schattauer 2009
Fig. 2 An m-rep model of a kidney (left) and the implied boundary surface (right). The model is composed of a single medial figure with a mesh of 3 x 5 medial atoms. The atom’s spokes are colored in turquoise and magenta.
the medial surface as a measure of size and 2) the (global) curvature of the kidneys as a description of their shape. The length of the (curved) medial surface can be derived from the atoms’ positions. The curvature is derived from the angle that is formed by the middle atom and the two related end atoms (in longi-
tudinal direction). For the curvature the following measure is chosen: π – ai , where ai is the calculated angle of the kidney variant i in radians. A standard technique for describing shape variability is principal component analysis (PCA) [24]. However, PCA is only applicable
Fig. 3 Illustration of the figural coordinate system of a medial atom (left) and an end atom (right). A medial atom is represented by its position x, the length r of the two boundary-pointing arrows called “spokes”, a frame made from the unit-length vector b and the two b-orthogonal unit vectors n and b ⊥, and the object angle θ between b and each spoke. The figural sides are described by t . The end atom (right), i.e. an atom at the outer edge of the medial mesh, has an additional parameter η ∈ ℜ+ that describes the object’s local elongation at the boundary crest. Methods Inf Med 3/2009
275
276
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
Fig. 4 The average kidney of the population (center) and the first two modes of variation in shape
for parameter vectors that are elements of the Euclidean vector space and thus can not be directly applied to m-rep models as the m-rep parameters include angles. For this reason the
PCA has been extended by Fletcher et al. [25, 26] to principal geodesic analysis (PGA) which is valid for m-rep parameters. Using this method we have calculated the first prin-
cipal components of the population of 48 kidneys. Figure 4 shows the mean kidney and the first and second mode of shape variation. For integration and visualization of dynamic processes in the atlas we have also built m-rep models of the right and left lobe of the breathing lung of four patients based on 4D CT data. The 4D CT datasets were acquired in different breathing phases during free breathing. An artifact-reducing reconstruction technique [27, 28] was used to generate 3D CT data at ten points in time during the breathing cycle using optical flow-based interpolation [29]. For both the right and the left lobe of the lung we used an m-rep mesh with 6 × 7 medial atoms and fitted the m-rep models to the lung for each point in time using the described optimization procedure. 씰Figure 5 shows the m-rep model of the lung for one of the patients. For high-quality 3D visualization of the breathing motion we interpolated the lung models in time on the basis of the m-rep parameters. That way the lung’s motion during breathing can be visualized in a smooth animation. Furthermore, mean models of the breathing lung lobes as well as the first modes of their shape variations during breathing are computed using the PGA [25, 26].
2.3 Integration of Shape Variability Models into the VOXEL-MAN Atlas
Fig. 5 An m-rep model of the breathing lung. Top: M-rep model of the right and left lobe of the lung at the time of maximum inspiration. Bottom: Surface representations of m-rep models of the lung at three points in time during the breathing cycle (left: maximum expiration; middle: mid-inspiration; right: maximum inspiration). Methods Inf Med 3/2009
For the integration of a population of shape models into the VOXEL-MAN atlas two main steps are necessary. First, the shape models have to be positioned in the atlas in an anatomically sensible way, i.e. surrounding and connecting tissue and organs have to be taken into account. And second, a geometric correspondence between the shape models and the matching organ in the volume-based atlas, which we call reference organ, has to be established. In the case of the kidney population we chose the average kidney as a representative of the population and manually fitted it to the reference kidney of the atlas using rigid transformation, i.e. size and shape of the model were conserved. The fitting was done in a way that the concave part of the kidney, i.e. the © Schattauer 2009
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
Fig. 6 User interface of the extended VOXEL-MAN system. The m-rep shape description has been integrated in the atlas and allows the exploration of shape variants of an object within the context of a volume-based 3D model. Here the mean left kidney of a population has been selected and is visualized in the scene on the right hand side. The corresponding m-rep model is shown in the window in the middle of the screen.
renal hilum, was in concordance with the atlas kidney. The renal hilum is the part where ureter and renal vein and artery are connected to the kidney and thus is an important connecting piece. For the individual kidneys that had been aligned previously the same rigid transformation was used as for the average kidney. For the establishment of a geometric correspondence between the m-rep models and the volume-based reference kidney we fitted the average kidney model to the reference kidney of the Visible Human data by employing the optimization procedure described in Section 2.2. As a result we received an m-rep model of the Visible Human kidney which possesses the same medial structure as the other kidney models, so that a geometric correspondence between the m-reps can be defined. It serves as a reference model and represents the link between the volume-based organ of the Visible Human data and the shape variants represented by m-reps. Furthermore, it allows a deformation of the volume-based kidney according to various shape descriptions.
© Schattauer 2009
For the lung model a time point during the breathing cycle has been selected for which the shape of the lung was closest to the lung of the Visible Human data. As these data are derived from a dead person its lung is not in one of the natural breathing phases but in an anomalous position. The m-rep models of the right and the left lobe of the lung of the selected phase were fitted to the lobes of the Visible Human lung, so that a reference model of the lung was received. In the case of the lung we did not aim at visualizing the various lung shapes of different individuals but transferring the movement of the breathing lung to the lung of the Visible Human. For that we extracted the shape changes of the lungs between the successive points in time during the breathing cycle and applied them to the Visible Human lung. For a surface visualization of an m-rep model the RGB values from the photographic Visible Human data set are used, giving the organs their natural colors. For a correct color mapping the geometric correspondence between the reference model and the m-rep model of interest is utilized.
To obtain a volume-based representation of a shape variant represented by an m-rep model a method has been developed that allows the deformation of the volumetric model of the organ in the Visible Human data set according to the m-rep coordinates. By this transformation internal structures of the reference kidney are deformed to visualize internal structures in the m-rep models of the population. The transformation field for this deformation is determined as follows: First, for each voxel (x, y, z) belonging to the reference organ in the atlas the figural coordinates (u, v, t, τ) are calculated using the reference m-rep model of the organ. Then, the figural coordinates received are transferred to the m-rep model of the considered shape variant and its position in the atlas coordinate system is determined. After calculation of the transformation field the deformed volume has to be generated. For doing this a transition from the continuous coordinates of the transformation field to the discrete voxel coordinates of the target image is necessary which is done by a procedure called resampling. An overview of resampling methods used in the field of Methods Inf Med 3/2009
277
278
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
image processing is given in [27]. We have employed an inverse method, i.e. for each voxel in the target volume a transformation vector is calculated. For calculation of the RGB values in the target volume we have used tri-linear interpolation which results in a certain smoothing effect but has been shown to be sufficient for our visualization purposes. For a reconstruction of the deformed object’s surface with subvoxel resolution the described transformation is calculated not only for voxels belonging to the considered object but also for a layer of voxels surrounding the object. The transformation developed enables the visualization of surface and internal structures in high quality.
That way an impression of the most prevalent variations in shape within a population can be seen. The shape variants may be depicted as surface models which can be rapidly calculated from the m-rep models. In this way a quick overview can be gained about the shape variability of a population of organs. The surface models are visualized in the organs’ natural colors giving them a more realistic look (씰Fig. 7). However, as the surface colors were derived from the Visible Human data set, they do not reflect the original colors of the individual shape variants. Besides a surface representation, a volume-based representation of an organ’s
3. Results With the integration of m-rep-based shape descriptions into the VOXEL-MAN atlas it is now possible to query and visualize different shape variations of an organ within the context of a volume-based anatomical model (씰Fig. 6). Attributes like gender, age or pathology can be assigned to each shape variant which then can be visualized by specifying these attributes. In addition to the representation of inter-individual shape variations (씰Fig. 7) or the average shape of a population (Fig. 6), intra-individual shape variations can be visualized as well, in our example lung motion induced by breathing. Shape variants can also be visualized according to shape and size parameters that have previously been derived from the m-rep parameters and they can be arranged in a sorted sequence. 씰Figure 8 shows an example in which the kidney variants are sorted by length of the medial surface and by curvature. The organs could also be arranged according to other attributes, like e.g. the patient’s age. That way it would be possible to visualize age-specific differences of an organ that appear in different periods of life (e.g. infancy, youth, maturity, age). For illustration of variability in shape within a population the principal components received from PGA are utilized (see Section 2.2). The organ shape can be visualized while moving along the first or second principal axes which are the axes in shape space in direction of the greatest variance (씰Fig. 4). Methods Inf Med 3/2009
Fig. 7 Visualization of shape variants of the kidney within the VOXEL-MAN atlas. The left kidneys (in the pictures on the right hand side) are surface representations on the basis of m-rep models. Top: left kidney of the Visible Human. Bottom: a shape variant of the kidney selected out of 48 shape models available.
shape variants can also be generated. It results from the deformation of the reference organ of the volume-based model according to the m-rep shape description. This way a realistic visualization of the shape variants is possible, as well as the visualization of the organ’s internal structures. As an example a kidney variant is shown after the application of cutting operations (씰Fig. 9).
4. Discussion The inclusion of anatomical variability in a 3D interactive atlas presents considerable challenges, since such a system must capture how anatomic structures vary in large populations, across age, gender and races, and in different disease states. We have presented a new approach for a variability atlas that is based on the VOXEL-MAN atlas and extends it by the m-rep shape description for modeling variability of anatomical objects. With the connection of a volume-based atlas and skeleton-based shape description the advantages of both methods are combined. The key advantages of the system presented are as follows: ● A volume-based model allows a highly realistic representation of the anatomy. In the VOXEL-MAN atlas anatomical structures are reconstructed with subvoxel resolution and are visualized using their natural colors. The result is a visual impression with high realism. ● The VOXEL-MAN atlas is a highly flexible tool for exploring the human anatomy. The integrated knowledge base allows a versatile interaction with the spatial model, e.g. the anatomical model can be interrogated or disassembled. Also, the model can be rotated in all directions and cutting planes can be placed and the interior of anatomical structures can be viewed, and so on. The whole functionality of the atlas is also accessible in the extended version presented. ● The m-rep shape description allows a compact representation of shape variability, i.e. characteristic shape properties are represented with a restricted number of parameters (in contrast e.g. to deformation fields). For a variability atlas that is designed for representing a high number of shape variants, a compact shape de© Schattauer 2009
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
Fig. 8 Shape variants can be characterized by shape and size parameters and arranged in a sorted sequence. Top: kidney variants sorted by length of medial surface (in cm). Bottom: kidney variants sorted by curvature (π – ai , where ai is the angle of variant i in radians).
scription is a prerequisite if the system should be able to run on standard PCs. We demonstrated the concept and functionality of this atlas using a statistical shape model of the kidney and a model of the breathing lung as examples. With the integration of the m-rep shape description into the VOXEL-MAN atlas it is now possible to query and visualize different shape variations of an organ, the average shape and main modes of variation in a population. Yet the approach presented has also its restrictions and limitations. One major issue is the interdependency of neighboring organs or structures regarding their shape and position. If the shape of an organ changes the surrounding structures naturally also move or deform which haven’t been taken into account so far in the system presented. In our examples we have used only m-rep models consisting of a single object (the kidney) or two objects (lobes of the lung). For modeling the dependency of neighboring organs it is necessary to build m-rep models for all organs of interest and combine them in a group. Another restriction is that visualizations of all shape variants are based on the same photographic data set. For both the surfacebased and the volume-based visualizations of shape variants the RGB values of the corresponding organ in the Visible Human data set are used and thus they do not reflect individual differences in color of these variants. While the visualization of shape variants in their original colors is a desirable goal it didn’t seem to be a realistic one for two reasons. © Schattauer 2009
Fig. 9 Volume-based representation of shape variants of the left kidney in the VOXEL-MAN atlas. Top: reference kidney from the original volume model (Visible Human). Bottom: volume representation of a shape variant of a kidney based on an m-rep description. The cut-away gives an insight to the interior of the kidney.
Methods Inf Med 3/2009
279
280
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
First, photographic volume data are only available for a very limited number of subjects and naturally not for patients. And second, storing photographic data for each shape variant would result in a huge amount of data – which is what we wanted to avoid by choosing a compact shape description. Also, it should be obvious that the approach presented is not reasonable for pathological variants that show substantial morphological differences to a healthy kidney. However, all healthy and pathological structures and organs with similar shape can be represented in the atlas by this technique.
5. Conclusion We developed a framework for representation and visualization of shape variability within a 3D interactive anatomical atlas using an organ-based approach. It could be a basis for the development of specialized variability atlases that focus e.g. on specific regions of the human body, groups of organs or specific topics of interest. If filled with appropriate data such an atlas might be able to answer questions like the following: ● What range of variations of organs is considered “normal”? ● What are the effects of disease on shape and size of anatomical structures? ● How do organs change during growth and under normal ageing? ● How is the shape of an anatomical structure related to factors like gender, habits or environmental influences? Possible applications of such variability atlases are mainly in the area of medical education but they could e.g. also serve as reference for health professionals. However, currently the use of statistical shape models in anatomical atlases is still in its infancy and there are only limited experiences available with the use of such statistical shape information in medical education. We have used our atlases of the kidney and the lung in seminars with medical students of the University Medical Center Hamburg-Eppendorf to show the range of possibilities of statistical anatomical atlases. The response was very positive, however a more detailed and extended evaluation of the educational benefits of the atlas showing shape variations of organs should be performed. Methods Inf Med 3/2009
In the current state of implementation, the Visible Human data set is used as reference data set in the extended VOXEL-MAN atlas. In principle, it is also possible to use other data sets, e.g. CT or MRI data sets, as reference data set in the framework. However, in CT or MRI data sets it is harder to segment fine anatomical structures because of the reduced image resolution and contrast in comparison to the high-resolution colored Visible Human data set. Hence, it is expected that the number of segmented structures represented in such an atlas will be strongly reduced in comparison to our currently available reference atlas. Furthermore, a complete segmentation of all relevant structures has to be performed to generate a new reference atlas and an adapted knowledge base has to be created. Hence, the effort to generate another reference data set is very high. The presented framework for representation and visualization of 3D shape variability of organs opens up new insights into the intra- and inter-individual shape variety of organs in an interactive 3D atlas. Further discussions with medical experts are needed to identify medical disciplines and applications where this new kind of information is helpful. Especially, it has to be explored how useful visualizations of statistical anatomical variants in an atlas environment are beyond the educational setting, e.g. in clinical applications and studies.
Acknowledgments We thank Stephen Pizer and the image processing group (MIDAG) of UNC in Chapel Hill for the provision of the software “Pablo” and the support in all problems and questions regarding the m-reps. We are also grateful to the members of the radiation oncology at UNC for providing the kidney data. We thank Karl-Heinz Höhne for the possibility to extend the VOXEL-MAN atlas in this project. This work is supported by the German Research Foundation (DFG).
References 1. Ackerman MJ. The Visible Human Project: A Resource for Anatomical Visualization. Medinfo 1998. pp 1030–1032. 2. Spitzer V, Ackerman M, Scherzinger A, Withlock D. The Visible Human Male: A Technical Report. J Am Med Inform Assoc 1996; 3 (2): 118–130.
3. Zhang SX, Heng PA, Liu ZJ, Tan LW, Qiu MG, Li QY, Liao RX, Li K, Cui GY, Guo YL, Yang XP, Liu GJ, Shan JL, Liu JJ, Zhang WG, Chen XH, Chen JH, Wang J, Chen W, Lu M, You J, Pang XL, Xiao H, Xie YM, Chun-Yiu Cheng J. The Chinese Visible Human (CVH) Datasets Incorporate Technical and Imaging Advances on Earlier Digital Humans. J Anat 2004; 204: 165–173. 4. Riemer M, Park JS, Chung MS, Handels H. Improving the 3D Visualization of the Visible Korean Human via Data Driven 3D Segmentation in RGB Color Space. World Congress on Medical Physics and Biomedical Engineering 2006, 14 (31). Springer, 2007. pp 4200–4203. 5. Höhne KH, Pflesser B, Pommert A, Riemer M, Schubert R, Schiemann T, Tiede U, Schumacher U. A Realistic Model of Human Structure from the Visible Human Data. Methods Inf Med 2001; 40: 83–89. 6. Structural Informatics Group. Interactive Atlases: Digital Anatomist Project. Dept. of Biological Structure, University of Washington, Seattle, WA, 2004. http://www9.biostr.washington.edu/da.html. 7. Bergman RA, Adel K, Miyauchi R. Illustrated Encyclopedia of Human Anatomic Variation. University of Iowa, 2004. http://www.anatomyatlases.org/ AnatomicVariants/AnatomyHP.shtml. 8. Toga AW, Thompson PM. Multimodal Brain Atlases. In: Wong S (ed.). Advances in Biomedical Image Databases. New York: Academic Press; 1999. 9. Mazziotta JC. A Probabilistic Atlas and Reference System for the Human Brain. In: Toga RW, Mazziotta JC. Brain Mapping – The Methods. San Diego: Academic Press; 2002. pp 727–755. 10. Thompson PM, Mega MS, Woods R, Zoumalan CI, Lindshield CJ, Blanton RE, Moussai J, Holmes CJ, Cummings JL, Toga AW. Cortical Change in Alzheimer’s Disease Detected with a Disease-specific Population-based Brain Atlas. Cereb Cortex 2001; 11 (1): 1–16. 11. Bailleul J, Ruan S, Bloyet D. Automatic atlas-based Building of Point Distribution Model for Segmentation of Anatomical Structures from Brain MRI. Proc Signal Processing and its Applications 2003; 2: 629–630. 12. Joshi S, Pizer SM, Fletcher PT, Yushkevich P, Thall A, Marron JS. Multiscale Deformable Model Segmentation and Statistical Shape Analysis using Medial Descriptions. IEEE Trans Med Imaging 2002; 21: 538–550. 13. Golland P, Grimson WEL, Kikinis R. Statistical Shape Analysis Using Fixed Topology Skeletons: Corpus Callosum Study. Information Processing in Medical Imaging (IPMI) 1999. pp 382–387. 14. Staib L, Duncan J. Boundary Finding with Parametrically Models. IEEE PAMI 1992; 14 (11): 1061–1075. 15. Kelemen A, Szekely G, Gerig G. Elastic Model-Based Segmentation of 3D Neurological Data Sets. IEEE Trans Med Imaging 1999; 18: 828–839. 16. Styner M, Gerig G. Hybrid Boundary-medial Shape Description for Biologically Variable Shapes. Proc IEEE Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA) 2000. pp 235–242. 17. Cootes TF, Taylor CJ. Active Shape Models – “Smart Snakes”. In: Hogg et al. (eds.). BMVC92. Proceedings of the British Machine Vision Conference. Berlin: Springer-Verlag; 1992. pp 266–275.
© Schattauer 2009
S. Hacker; H. Handels: Visualization of 3D Shape Variability of Organs in an Interactive Anatomical Atlas
18. Tiede U, Schiemann T, Höhne KH. High Quality Rendering of Attributed Volume Data. In: Ebert D, Hagen H, Rushmeier H (eds.). Proc. IEEE Visualization 1998. Research Triangle Park, NC; 1998. pp 255–262. 19. Pommert A, Schubert R, Riemer M, Schiemann T, Höhne KH. Symbolic Modeling of Human Anatomy for Visualization and Simulation. In: Robb RA (ed.). Visualization in Biomedical Computing 1994. Rochester; 1994. pp 412–423. 20. Höhne KH, Pflesser B, Pommert A, Riemer M, Schubert R, Tiede U. A New Representation of Knowledge Concerning Human Anatomy and Function. Nature Med 1995; 1 (6): 506–511. 21. Pizer SM, Fritsch D, Yushkevich P, Johnson V, Chaney E. Segmentation, Registration, and Measurement of Shape Variation via Image Object Shape. IEEE Trans Med Imaging 1999; 18: 851–865.
© Schattauer 2009
22. Pizer SM, Fletcher PT, Joshi SC, Stough J, Thall A, Chen JZ, Fridman Y, Fritsch DS, Gash G, Glotzer JM, Jiroutek MR, Lu C, Muller KE, Tracton G, Yushkevich PA, Chaney E L. Deformable M-reps for 3D Medical Image Segmentation. Int J Comp Vis 2003; 55: 85–106. 23. Blum TO. A Transformation for Extracting New Descriptors of Shape. In: Wathen-Dunn (ed.). Models for the Perception of Speech and Visual Form. Cambridge, MA: MIT Press; 1967. pp 362–380. 24. Jollife IT. Principal Component Analysis. SpringerVerlag; 1986. 25. Fletcher PT, Lu C, Joshi S. Statistics of Shape via Principal Component Analysis on Lie Groups. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2003. pp 95–101. 26. Fletcher PT, Joshi S, Lu C Pizer SM. Gaussian Distribution on Lie Groups and their Application to
Statistical Shape Analysis. In: Taylor C, Noble JA (eds.), Information Processing in Medical Imaging (IPMI) 2003. pp 450–462. 27. Ehrhardt J, Werner R, Frenzel T, Säring D, Lu W, Low D, Handels H. Optical Flow based Method for Improved Reconstruction of 4D CT Data Sets Acquired During Free Breathing. Med Phys 2007; 34 (2): 711–721. 28. Werner R, Ehrhardt J, Frenzel T, Säring D, Lu W, Low D, Handels H. Motion Artifact Reducing Reconstruction of 4D CT Image Data for the Analysis of Respiratory Dynamics. Methods Inf Med 2007; 46: 254–260. 29. Ehrhardt J, Säring D, Handels H. Structure-preserving Interpolation of Temporal and Spatial Image Sequences using an Optical Flow-based Method. Methods Inf Med 2007; 46: 300–307.
Methods Inf Med 3/2009
281
282
© Schattauer 2009
Original Articles
Efficiency of CYP2C9 Genetic Test Representation for Automated Pharmacogenetic Decision Support V. G. Deshmukh1; M. A. Hoffman2; C. Arnoldi2; B. E. Bray1; J. A. Mitchell1 1University 2Cerner
of Utah, School of Medicine, Department of Biomedical Informatics, Salt Lake City, UT, USA; Corporation, Kansas City, MO, USA
Keywords Pharmacogenetics, clinical decision support systems, SNP, allele
Summary Objectives: We investigated the suitability of representing discrete genetic test results in the electronic health record (EHR) as individual single nucleotide polymorphisms (SNPs) and as alleles, using the CYP2C9 gene and its polymorphic states, as part of a pilot study. The purpose of our investigation was to determine the appropriate level of data abstraction when reporting genetic test results in the EHR that would allow meaningful interpretation and clinical decision support based on current knowledge, while retaining sufficient information in order to enable reinterpretation of the results in the context of future discoveries. Methods: Based on the SNP & allele models, we designed two separate lab panels within the laboratory information system, one con-
Correspondence to: Vikrant G. Deshmukh, M.Sc., M.S. University of Utah, School of Medicine Department of Biomedical Informatics 26 South 2000 East Room 5775 Salt Lake City, UT 84112 USA E-mail: [email protected]
1. Introduction With the completion of the human genome project [1], and continued discoveries in the genetics of human diseases, a number of molecular-genetic tests have now become available [2], and the integration of these tests in clinical care will be a major step toward delivering personalized medicine [3, 4]. The storage, retrieval and reporting of genetic test results, as well as Methods Inf Med 3/2009
taining SNPs and the other containing alleles, built separate rules in the clinical decision support system based on each model, and evaluated the performance of these rules in an EHR simulation environment using realworld scenarios. Results: Although decision-support rules based on allele model required significantly less computational time than rules based on SNP model, no difference was observed on the total time taken to chart medication orders between rules based on these two models. Conclusions: Both, SNP- and allele-based models, can be used effectively for representing genetic test results in the EHR without impacting clinical decision support systems. While storing and reporting genetic test results as alleles allow for the construction of simpler decision-support rules, and make it easier to present these results to clinicians, SNP-based model can retain a greater amount of information that could be useful for future reinterpretation.
Methods Inf Med 2009; 48: 282–290 doi: 10.3414/ME0570 received: May 4, 2008 accepted: November 11, 2008 prepublished: March 31, 2009
their integration within the clinical environment pose unique challenges [5], and the HL7 [6] clinical genomics standard (CGS) has been proposed to address such issues [7]. One of the strengths of the CGS model is the ‘encapsulate and bubble-up’ approach, in which raw genomic data are reported along with the genetic test results, while additional interpretations can bubble-up as new knowledge and data become available [8].
The interpretation of existing genetic data within the context of emerging scientific knowledge would require data abstraction at various levels such as single nucleotide polymorphisms (SNPs), alleles, haplotypes, etc., in accordance with the corresponding discoveries. While genetic data can be represented using Bioinformatics Sequence Markup Language (BSML) [9], clinical findings are typically stored using SNOMED CT [10] and LOINC [11], and although the latter two vocabularies could also be used for reporting results of genetic tests, they lack the granularity required to describe genetic findings in sufficient detail that would allow meaningful clinical inference [12]. The Clinical Bioinformatics Ontology (CBO) is a semantically structured controlled medical vocabulary that enables the standardized reporting of clinical molecular diagnostics in a consistent, machine-readable format. The CBO contains manually curated, pre-coordinated concepts that have been annotated with orthogonal mapping to bioinformatics databases in the form of facets [13], and is therefore suitable for the generation of executable knowledge needed for performing advanced queries, inference logic and knowledge discovery in the area of clinical molecular diagnostics [12, 13]. An allele is one of many alternate forms of a gene that occupies a given locus on a chromosome, and can consist of one or more polymorphisms in a gene relative to a reference sequence. An alternate form of a gene consisting of more than one SNP can be efficiently described using an allele name (or symbol) in lieu of a list of the constituent polymorphisms. The Cytochrome P-450 (CYP) enzyme system is part of a common metabolic pathway for several important classes of drugs; and differences in drug metabolism have been attributed to various SNPs [14–16] (e.g.: CYP2C9 1076 C > T, 1188delA, etc.), which can also be described
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
by their representative alleles [17, 18] (e.g.: CYP2C9*2, *3, etc.). Warfarin, a commonly prescribed oral anticoagulant, is a vitamin-K antagonist [19] metabolized by CYP2C9 [20], with a narrow therapeutic index, and significant inter-patient variability in dose response, due to which, it has been underutilized [21]. The variability in dose response to Warfarin has been partially explained by SNPs in CYP2C9 and VKORC1 [22] genes, and the United States Food and Drug Administration’s (FDA) new labeling on all Warfarin products [23] underscores the importance of pharmacogenetics in general, and of this use-case in particular. Since the availability of genetic tests for CYP2C9 and VKORC1 genes, dosing algorithms that incorporate results of these two tests are now available [24], and their implications in Warfarin dosing may be conceptually represented by applying the hierarchical knowledge model [25] (씰Fig. 1). In Figure 1, the results of molecular assays used to detect known SNPs [14] constitute raw data, which may be represented by the corresponding alleles [18] that subsume the SNPs (information), which may then be interpreted according to the expected phenotype [15] as slow-metabolizers of Warfarin (knowledge), which may be understood by clinicians as having elevated risk of bleeding complications [26] in Warfarin therapy (understanding), who may then adjust the Warfarin dosage [24] by using a combination of dosing nomograms, pharmacogenetics, the physiological condition of the patient, and their experience in treating patients with similar conditions (wisdom). Alternatively, genetic findings may be reported with recommendations for dose adjustment, which could provide clinical context for the results. Although the above knowledge model may work well in the short-term, with an increasing use, and evolving knowledge of the implications of these test results in clinical practice, the complexity of clinical information is likely to increase [27]. In addition, the issues with navigating information sources that enable genotype-to-phenotype type translation of such knowledge into clinical practice [28] necessitate the use of clinical decision support systems (CDSS) at the point of care [29, 30]. CDSS require that the genetic test results, as well as the interpretations reported in the Laboratory Information System (LIS) and the © Schattauer 2009
Fig. 1 Conceptual levels of data abstraction: A set of SNPs can be considered data, the allele containing the SNPs as information, the resulting slow-metabolizer phenotype as knowledge, the implications on bleeding complications as understanding, while the overall need to adjust Warfarin dosage based on all of the above as wisdom.
EHR be in discrete, concise and machinereadable format. Further, with the increasing availability of genetic testing in external reference labs [2] (or potentially direct-to-consumer genetic testing facilities [31, 32]) results of these genetic tests may not have been entered in the same LIS as the rest of their results, in which case, these genetic test results would then have to be collected as part of their history & physical (H&P) examination. However, having H&P as plain text would negate the benefit of having a point-of-care CDSS, particularly for genetic test results, which often tend to be more complex than other clinical findings [27], and it would become necessary to capture any genetic information provided by the patient in a discrete, coded format, rather than as a textual narrative. In the absence of an appropriate level of data abstraction, the sheer amount and complexity of information generated by genetic testing have the potential to overwhelm existing EHR systems as well as the clinical endusers. In the present work, we investigate the suitability of reporting genetic data in the EHR at the level of SNPs and alleles by comparing these two data models from the perspective of clinical decision support, report-
ing within the EHR, and suitability for integrating future discoveries.
2. Methods Our pilot project involved the CYP2C9 gene and the corresponding alleles and SNPs known to have clinical significance in Warfarin therapy. The initial prototyping was performed in a simulation environment at Cerner Corporation headquarters in Kansas City, MO, and then the decision-support component was reconstructed in a live clinical system environment at the University of Utah Hospital, Salt Lake City, UT. Although all software testing was performed using Cerner software, our methods are generalizable, and can be evaluated using any EHR system that integrates a point-of-care clinical decision support system (CDSS), and is independent of the underlying computational environments, databases, etc. (the environments that we prototyped and tested in have different underlying hardware and software architectures). All the decision support rules used in our study are available for download as standard Arden syntax files at http:// Methods Inf Med 3/2009
283
284
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Fig. 2 Block diagram of information architecture: The above schematic shows a simplified, scaled version of the various components in our EHR system. The EHR system contains several integrated modules, and these communicate with other components and with the database through a middleware layer. The CDSS component resides in the middleware layer, and is available within most components of the EHR.
Allele
Table 1
SNP:
Allele
SNP: Genomic DNA Subtype cDNA
*1
*2
*3
*1A
None
None
*1B
2665_2664delTG; 1188T>C
*1C
1188T>C
*1D
2665_2664delTG
*2A
430C>T
1188T>C; 1096A>G; 620G>T; 485T>A; 484C>A ; 3608C>T
*2B
2665_2664delTG; 1188T>C; 1096A>G; 620G>T ; 485T>A ; 484C>A ; 3608C>T
*2C
1096A>G; 620G>T; 485T>A; 484C>A; 3608C>T
*3A
1075A>C
*3B
CYP2C9 alleles, SNPs and lab panels (adapted from the Human Cytochrome P450 (CYP) Allele Nomenclature Committee’s website) [18]
1911T>C; 1885C>G; 1537G>A; 981G>A; 42614A>C 1911T>C; 1885C>G; 1537G>A; 1188T>C; 981G>A ; 42614A>C
*4
1076 T>C
42615T>C
*5
1080 C>G 42619C>G
*6
818delA
10601delA
informatics.bmi.utah.edu/cyp2c9/suppl/. The latest version of the CBO is available for download at http://www.clinbioinformatics. org. Methods Inf Med 3/2009
The Cerner® Millennium® platform consists of several modules that serve different functions, leveraging a common application and database infrastructure. The block dia-
gram in 씰Figure 2 shows a simplified, scaleddown schematic of various components of the EHR that are relevant to the present work. Laboratory orders can be placed in the CPOE module of the EHR, or in the LIS modules, and the results of lab tests can be charted in the LIS. One of the differences in the EHR environments that were used during prototyping and testing was that our testing environment at the University of Utah Hospital receives lab results over an HL7 interface, whereas the prototyping environment at Cerner Corporation shared a common database with the LIS module through the EHR middleware. Regardless of the setup, however, once charted, the results are then automatically sent to the EHR application, where they appear under the lab results section in the patient’s chart. The same genetic results can also be charted as part of discrete patientcare documentation within the EHR itself as part of the clinical documentation module, which also posts these results to the exact same place within the database through the middleware, which is important when considering scenarios where genetic test results may have been provided directly by the patients themselves as results from other independent labs [2] during their history and physical examination. The integrated, pointof-care CDSS module in the EHR which operates within the middleware layer is able to consume the results, and communicate the alerts to several different applications, regardless of the application/method in which these were posted to the EHR, making our methods even more generalizable. Medication orders can be placed in the EHR, or in a separate pharmacy module, and the clinical decision support system runs in the background in both of these modules. During prototyping at Cerner Corporation, we developed two lab panels in the LIS (씰Table 1), a panel for reporting test results as alleles (allele panel) and another panel for reporting test results as SNPs (SNP panel). Discrete genetic test results within each of these lab panels were mapped to their corresponding pre-coordinated concepts within the CBO, using the CYP2C9*2A allele as an example (씰Table 2). E.g. the definition of the CBO allele concept CYP2C9.004 contains several individual SNPs represented by the respective CBO concepts, and also the synonym CYP2C9*2A, which represents the allele. © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Within the allele panel, the concept representing CYP2C9*2A was used as-is for storing one discrete data element, whereas in the SNP panel, the individual concepts that represent SNPs contained in this concept each had their own separate, discrete data elements. The LIS also allowed automated interpretation of genetic results as ‘homozygous normal’, ‘heterozygous affected’ and ‘homozygous affected’, based on the results entered for each copy of the allele or SNP in the corresponding lab panels. Upon electronically signing the results charted in a given lab panel for a test patient, these results and their automated interpretations were posted to the EHR. During testing at the University of Utah, the same genetic test panels were recreated within the discrete clinical documentation module of the EHR itself, and results were charted using that module. Decision-support rules based on each data model were built in the CDSS module, using the logic illustrated in 씰Figure 3A. In addition to the individual rules based on allele- & SNP-models, two other generic rules were created, one of which triggered upon adding a medication to the list of medication orders (rule ‘A’), and another which triggered upon actually signing the medication order (rule ‘D’). Individual rules based on allele- & SNPmodels (rules ‘B’ and ‘C’ respectively) were set to trigger in response to the placing of Warfarin orders, and the rule-evaluation criteria were slightly different for each rule (씰Fig. 3B), with the allele-rule checking for lab values indicating CYP2C9 *2, *3, or *6 alleles (Table 1, column 1), and the SNP-rule checking for all the corresponding SNPs for each of these alleles (Table 1, columns 3/4). The execution order of these rules were set so that during the process of adding Warfarin to a patient’s list of medication orders, the first rule that fired was the generic rule ‘A’, followed by either ‘B’ or ‘C’ depending on whether we were testing the allele-model or the SNPmodel, and then finally rule ‘D’. The common element within each of these rules was an action that created a database time-stamp, with precision in milliseconds, so that the difference between three time-stamps in each test case would give us the actual amount of time needed to evaluate the rule, the time taken to respond to the medication alert (씰Fig. 4), and the total time needed to complete individual orders. © Schattauer 2009
Table 2 Concepts and relationships in the Clinical Bioinformatics Ontology using CYP2C9*2A allele as an example [12]
CBO Concept 1
Relationship
CBO Concept 2
Human Allele
Subsumes
CYP2C9.0004
CYP2C9.0004
Synonym
CYP2C9*2A
CYP2C9.0004
Has constituent variant
CYP2C9.c.-1188T>C
CYP2C9.0004
Has constituent variant
CYP2C9.c.430C>T
CYP2C9.0004
Has constituent variant
CYP2C9.c.-1096A>G
CYP2C9.0004
Has constituent variant
CYP2C9.c.-620G>T
CYP2C9.0004
Has constituent variant
CYP2C9.c.-485T>A
CYP2C9.0004
Has constituent variant
CYP2C9.c.-484C>A
Fig. 3 Testing methodology. A: The overall testing method showing the rules being triggered upon placing an order for Warfarin on a test patient on whom CYP2C9 Allele/SNP results were available; B: Overall differences in the rule evaluation logic for the Allele & SNP rule.
Methods Inf Med 3/2009
285
286
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Fig. 4 Medication alert: Medication alert triggered by adding Warfarin to the list of medication orders on a test patient with results for CYP2C9 Allele/SNP results
The CDSS rules ‘B’ and ‘C’ were tested in isolation from one another by enabling one, while having disabled the other, and using test patients for whom the corresponding SNPs
or alleles were reported as positive through the LIS or the clinical documentation modules. This was done in order to prevent test conditions where either of these rules could
Fig. 5 Rule execution times: Rule execution times were measured as the difference between the database time-stamps on adding Warfarin to the list of the test patient’s medication orders, and the time needed to complete evaluation of the rule logic. Methods Inf Med 3/2009
be triggered simultaneously, in order to avoid any potential impact on each other, so that the triggering of the rules followed an order A-B-D for the allele-model and A-C-D for the SNP-model. It is possible to place the same medication order through either the inpatient pharmacy system module or the Computerized Provider Order Entry (CPOE) module, and so as to allow the execution of the rule. The CPOE module is integrated tightly with the main EHR application, and the drug-gene interaction alerts were primarily intended to be seen by physicians at the point where they would place the orders. Thus, for our testing, we chose to place these orders using the CPOE module of the EHR system, rather than the pharmacy module. Each order for Warfarin placed in this manner was later discontinued, and this process was then repeated so that there were no active orders for Warfarin on the given test patient at the time of placing another order. This was done in order to avoid triggering other existing error-checking mechanisms such as therapeutic duplication checking and dose-range checking, which could have potentially introduced confounders by interfering with rule execution. The execution of the rule in each case generated a popup alert (Fig. 4) indicating a medication warning due to an underlying genetic condition, with options to accept or cancel the order, or ignore the recommendation, and a link for additional information describing the importance of these genotypes in Warfarin therapy [26]. Although it was also possible to make numeric dosing recommendation based on pharmacogenetic data [24], this was not included in our tests to minimize confounding. Over 50 orders were placed to evaluate each rule in this manner for Warfarin on each test patient, with the rule triggering on every event, and the rule execution times recorded as time-stamps in the EHR database using database triggers. In order to further minimize confounding due to differences in system utilization during testing, the simulations were performed during the same time periods of a day to account for system load. Other aspects of reporting genetic test results in the EHR, such as the formats for reporting results and interpretations to the clinicians were also considered, although a formal evaluation was not performed as part of the present study. © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
3. Results 3.1 Rule Execution Times The rule execution times were determined by examining the differences in database timestamps from the point of rule triggering to the completion of rule-logic evaluation and generating an alert in the EHR. The results for rules based on both allele- and SNP-models are plotted in 씰Figure 5. These measurements were logged in milliseconds, and the average rule execution times were 25.06 ms (n = 50) for the allele rule and 57.64 ms (n = 50) for the SNP rule (씰Table 3). Using a twotailed Student t-test of two samples assuming equal variance, the p-value was <6.14679E-71 (α = 0.05, df = 98), and thus, there was a significant difference between the means for rule-evaluation times for allele-model and SNP-model.
3.2 Times Required for Completing Medication Orders When testing the decision support rules based on each model, these were executed in a predetermined sequence (A-B-D or A-C-D), with the very first rule being triggered on adding any medication to the list of medication orders, followed by either the allele-/ SNP-based rule, and finally the rule triggered upon signing the medication order. The difference in time-stamps from the time Warfarin was added to the list of medication orders (rule A) and the time the order was signed (rule D) was the total medication order time measured in seconds, and the mean times for the two models were 6.603 s for the allelemodel, and 6.578 s for the SNP-model (Table 3). Using a two-tailed Student t-test of two Table 3 Differences in rule execution times, reaction times and total times
© Schattauer 2009
samples assuming equal variance, the p-value was 0.904 (α = 0.05, df = 98), and thus, there was no significant difference between means for total time required for signing medication orders in these two experiments. Similarly, the difference in time-stamps from the time that the allele-/SNP-based rule was triggered (rule B or D) to the time that the medication order was signed (rule D) was the reaction time to the alert generated by the EHR, which was measured in seconds. The mean times for the two models were 6.578 s for the allele-model, and 6.520 s for the SNPmodel (Table 3). Using a two-tailed Student t-test of two samples assuming equal variance, the p-value was 0.784 (α = 0.05, df = 98), and thus, there was no significant difference between means for reaction times in responding to the alerts in the EHR in these two experiments.
3.3 Reporting Genetic Results in the EHR Genetic test results for both panels were reported under the category of ‘labs’ within the EHR. For the allele panel (Table 1), 12 discrete results were generated per panel ordered: one per each copy of the allele (e.g.: Copy 1: CYP2C9*2 ‘Present’; Copy 2: CYP2C9*2 ‘Absent’) per allele. The report for the SNP panel (Table 1) was much more extensive due to the number of SNPs, with one discrete result per copy of the SNP (e.g.: Copy 1: CYP2C9 1096A>G ‘Absent’; Copy 2: CYP2C9 1096A > G ‘Present’), thus generating about 32 discrete genetic test results in the EHR per test order.
Time
Sample size
Mean
Standard deviation
Allele rule execution
50
25.06 ms
2.431 ms
SNP rule execution
50
57.64 ms
3.997 ms
Allele alert reaction time
50
6.578 s
0.908 s
SNP alert reaction time
50
6.520 s
1.182 s
Allele total time
50
6.604 s
0.909 s
SNP total time
50
6.578 s
1.182 s
4. Discussion The integration of genetic data in clinical care is an important step toward delivering personalized medicine, and point-of-care CDSS that enable recommendations based on these data will serve as important means of realizing this goal. Given the inherent complexity of genetic data and the need to have concise, human-interpretable guidelines at the pointof-care, it will be necessary to present these data in a form that can be consumed by frontline clinicians, and it will therefore be necessary to abstract these data. However, each level of data-abstraction going from the complete DNA sequence onward to SNPs, alleles, haplotypes, etc. comes with tradeoffs in terms of current vs. future usability of the findings, performance of the CDSS, loss of information that may be important in secondary use, etc. In the present study, we have considered one such scenario involving data abstraction by comparing two data models for reporting genetic data in the EHR based on SNPs & alleles, and have considered some of the potential implications of choosing either of these data models in CDSS involving genetic data at the point-of-care by tackling a realworld clinical problem involving CYP2C9 polymorphisms & Warfarin dosing in a realworld EHR environment.
4.1 Rule Execution Times From the CDSS database time-stamps, it was estimated that the average rule-execution times for the allele model were 25.06 ms, while that for the SNP model were 57.64 ms (Table 3), which were both within acceptable limits for interactive software applications, and would not have a negative impact on end-
Degrees of freedom
Two-tailed t-statistic (tc = 1.96)
p-value (α = 0.05)
98
49.245
6.15E-71
98
0.275
0.784
98
0.120
0.904
Methods Inf Med 3/2009
287
288
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
user applications if these rules were triggered in isolation. In spite of the significant differences in rule execution times, differences between the total rule execution times as well as the reaction times to the EHR alerts for the Allele-/SNP-models were insignificant, since these two times, measured in seconds, differed from the rule execution times by at least two orders of magnitude. In other words, within our experiments, there was plenty of room to accommodate more complex CDSS rules before they could have had a noticeable impact on the end-user applications. However, in a clinical system with multiple CDSS rules being evaluated on multiple patients, the overall rule execution times could still vary, and possibly impact the performance and responsiveness of the front-end applications. Some of these potential issues with performance could be addressed by consolidating two or more CDSS rules into a single rule which runs faster, but such an approach could potentially create new problems by adding to the rule complexity, and maintaining such rules over time.
4.2 Rule Complexity With the growing complexity of knowledge in the genetics of human diseases, CDSS rules will also tend to become more complex, and this complexity was readily apparent in the differences in rule logic between the SNP & allele models (씰Fig. 3B). The complexity of CDSS rules can be addressed by constructing an executable knowledge base of SNPs, alleles, haplotypes, etc., and the relevant clinical effects, interpretations and recommendations, so that pharmacogenetic decision support could be driven by stored, updateable knowledge instead of hard-coded logic such as that used in our rules. The PharmGKB, a publicly available, searchable online resource, is one such pharmacogenetic knowledge base [33] that contains current information on the relationships between drugs, diseases and genes, but in order to be used as a knowledge base for CDSS rules, the rules based on these relationships themselves will have to be formalized and stored in an executable format. Molecular-genetic vocabularies such as the CBO, which contain pre-coordinated concepts for genetic findings (e.g.: the CBO concept CYP2C9.c.430C >T implies a change Methods Inf Med 3/2009
from Cytosine to Thymine at position 430 in the cDNA of CYP2C9 gene) will have to be combined with other clinical vocabularies such as SNOMED CT [10] to adequately describe the effect as well as the recommendations that will be required for enabling pharmacogenetic decision support at the level necessary for formalizing these rules. However, at present, most clinical vocabularies lack the granularity and coverage needed to describe the effects and interpretations of molecular-genetic tests as well as recommendations based on these results, and may require considerable improvement before they can be used for effectively representing these rules in an executable knowledge base.
4.3 Precision of Allele Assignment With DNA sequencing becoming the gold standard for genetic information, any other form of capturing SNP findings is subject to imprecision. For instance, an allele-specific PCR panel could generate SNP results that are interpreted as a specific allele. Subsequent DNA sequencing could identify a SNP that was not specifically targeted in the initial panel and require the correction of the allele assignment. Therefore, it is important to consistently document the method used to generate an allele assignment. A vocabulary concept descriptive of a finding that represents an allele can be semantically related to other concepts that represent the corresponding SNPs for that allele. Using the allele concepts for reporting the results of genetic tests does not exclude the possibility of other SNPs being detected by a more comprehensive assay, but the burden falls on the LIS to be configured to accurately describe the methodology. Some allele terminologies imply phylogenetic relationships between allele names, which present a problem of precision, and therefore accurate descriptions of these methodologies becomes important. E.g.: subtypes of the CYP2C9*2 allele, *2A & *2B, are identical from a functional perspective, causing the same overall change in the cDNA (Table 1). However, the *2B allele includes a few more SNPs in addition to those found in *2A, and reporting the results as only the relevant allele *2 would lead to a loss of this information, which may
be important to retain in the context of future discoveries. Further, allele nomenclature itself is not consistent across different genes, and it may be more suitable to report all individual SNPs within the LIS and the EHR, in order to retain the maximum amount of information. The CBO addresses these concerns by creating unique concepts for each allele regardless of functional significance and through the use of a phylogenetically neutral naming convention that is consistent for all genes represented (modeled after the allele convention utilized by OMIM). In the light of recent discoveries [34] in the genetics of human diseases, it is important to retain as much information as possible about both the findings and the methods used to generate those findings., Although the allele model allows the construction of simpler CDSS rules (Fig. 3B), this model requires clear presentation of the method in order to prevent loss of information that may be useful in ‘bubbling up’ interpretations of existing data, in the future. The HL7 CGS approach has provisions for capturing all possible genetic information for the explicit purpose of allowing future reinterpretation; however, since the EHR system in question is built around a database-driven architecture, it does not implement HL7 information models directly, like a majority of present-day EHRs. Using an ontology like the CBO, which is structured around biological observations in conjunction with another codification system (e.g. LOINC) to describe the actual method used to collect such observations could possibly reduce such information loss.
4.4 Reporting Genetic Results Reporting discrete genetic test results in the EHR could pose some unique problems for clinicians. Unlike other lab results such as ‘International Normalized Ratio’ (INR), where the results can be interpreted directly within the context provided by the normal ranges, and have traditionally been a part of clinicians’ training, genetic test results such as those obtained during CYP2C9 testing may require additional clinical recommendations in addition to the results themselves. This becomes particularly evident when considering the SNP panel in our simulations, where 32 discrete results could be reported as part of a © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
single assay, compared to 12 that can be reported as part of the allele panel. In each of these cases, the only clinically relevant piece of information is that having certain SNPs would predispose the patient to slower metabolism of the drug Warfarin, thereby increasing their risk of bleeding complications during anticoagulation therapy (Fig. 1), thereby necessitating dose adjustments. In the light of the constantly improving molecular-genetic diagnostic methods, it is important to retain as much information as possible, so that future re-interpretations of existing results could be performed in the proper context of the sensitivity/specificity of these assays. However, presenting all these discrete results may not be of much direct value to clinicians at the point-of-care, and may be counter-productive, whereas presenting the same results as a decision-support alert along (Fig. 4) with a suitable recommendation at the point of ordering Warfarin may be far more desirable.
4.5 Limitations Although we considered the CYP2C9 gene and its implications in anticoagulation therapy, the CYP allele nomenclature itself is unique in some ways, and frequently, genetic variations may be described by haplotypes rather than alleles, as is the case with VKORC1 gene, which would then subsume the constituent SNPs in a manner similar to the alleles described in this work. However, with regard to differences in complexity of rule design for CDSS, and scenarios involving loss of information depending on the level of data abstraction, it is still possible to generalize these findings. Scenarios involving multiple genes, alleles and haplotypes were not considered in the present study, and these could further add to the complexity of CDSS rules that were considered in the evaluation of the two models.
5. Conclusions The present work represents one of the first efforts at exploring the real-world application of genetic data in the EHRs using decision support, and the issues we have considered represent a few among the myriad of ques© Schattauer 2009
tions that will arise from the increased use of genetic data in clinical care in the future. We evaluated two data models for CDSS rules in the EHR on the basis of their performance, complexity, loss of information and reporting within the EHR. Although there was a significant difference between the computational times needed for evaluating rules based on the allele model and the SNP model, this difference, being in the order of milliseconds, did not translate into a significant difference in the time taken to place a Warfarin order. CDSS rules based on the SNP data model are inherently complex, and will be difficult to maintain with the continuous addition of new knowledge in this domain. Although the allele model allowed for simpler clinical decision support and clinical reporting, maintaining an allele nomenclature locally can be a challenge over time. The issue is further complicated by incorrect assignment of allele concepts during system implementations. At the present time, due to the lack of a pharmacogenetic knowledge base containing rules and recommendations in an executable, machine-readable format, as well as a consortium of experts to maintain such a resource, it may be necessary to hard-code many of the decision-support rules involving pharmacogenetic data, thus necessitating the abstraction of genetic test results for use in EHRs; and the appropriate level of data abstraction will ultimately have to be decided on a per-gene basis.
Acknowledgments This research was supported by an education and travel award from the Cerner Corporation to the University of Utah Department of Biomedical Informatics, and by the University of Utah Health Sciences Information Technology Services. The authors are also grateful to Lisa Prins from the University of Utah Hospital, and to Nick Smith, Scott Haven and Ginger Kuhns from Cerner Corporation.
References 1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004; 431 (7011): 931–945. 2. GeneTests.org at the University of Washington. http://www.genetests.org/. 1–15–2007.
3. Personalised medicines: hopes and realities. http://www.royalsoc.ac.uk/displaypagedoc.asp?id =15874. 9–1–2005. The Royal Society (UK). 4. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature 2003; 422 (6934): 835–847. 5. Mitchell DR, Mitchell JA. Status of clinical gene sequencing data reporting and associated risks for information loss. J Biomed Inform 2007; 40 (1): 47–54. 6. Health Level 7. http://www.hl7.org. 3–2–2007. 7. Shabo A. Introduction to the Clinical Genomics Specifications and Documentation of the Genotype Topic. HL7 Clinical Genomics SIG DSTU Update 2 (Genotype topic update 1)[V0.9]. 11–5–2006. HL7.org. 8. Shabo A, Dotan D. The seventh layer of the clinicalgenomics information infrastructure. IBM Systems Journal 2007; 46 (1): 57–67. International Business Machines Corporation. 9. Bioinformatic Sequence Markup Language. http://www.bsml.org. 3–2–2007. 10. SNOMED CT. http://www.snomed.org/snomedct/. 3–2–2007. 11. McDonald CJ, Huff SM, Suico JG, et al. LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clin Chem 2003; 49 (4): 624–633. 12. Hoffman M, Arnoldi C, Chuang I. The clinical bioinformatics ontology: a curated semantic network utilizing RefSeq information. Pac Symp Biocomput 2005; 139–150. 13. Clinical Bioinformatics Ontology Whitepaper. https://www.clinbioinformatics.org/cbopublic/. 2004. Cerner Corporation. 14. Furuya H, Fernandez-Salguero P, Gregory W, et al. Genetic polymorphism of CYP2C9 and its effect on warfarin maintenance dose requirement in patients undergoing anticoagulation therapy. Pharmacogenetics 1995; 5 (6): 389–392. 15. Rettie AE, Wienkers LC, Gonzalez FJ, Trager WF, Korzekwa KR. Impaired (S)-warfarin metabolism catalysed by the R144C allelic variant of CYP2C9. Pharmacogenetics 1994; 4 (1): 39–42. 16. Takahashi H, Echizen H. Pharmacogenetics of CYP2C9 and interindividual variability in anticoagulant response to warfarin. Pharmacogenomics J 2003; 3 (4): 202–214. 17. Stubbins MJ, Harries LW, Smith G, Tarbit MH, Wolf CR. Genetic analysis of the human cytochrome P450 CYP2C9 locus. Pharmacogenetics 1996; 6 (5): 429–439. 18. Oscarson M, Ingelman-Sundberg M. CYP alleles: a web page for nomenclature of human cytochrome P450 alleles. Drug Metab Pharmacokinet 2002; 17 (6): 491–495. 19. Bell RG, Sadowski JA, Matschiner JT. Mechanism of action of warfarin. Warfarin and metabolism of vitamin K 1. Biochemistry 1972; 11 (10): 1959–1961. 20. Kaminsky LS, Zhang ZY. Human P450 metabolism of warfarin. Pharmacol Ther 1997; 73 (1): 67–74. 21. Horton JD, Bushwick BM. Warfarin therapy: evolving strategies in anticoagulation. Am Fam Physician 1999; 59 (3): 635–646. 22. Rost S, Fregin A, Ivaskevicius V, et al. Mutations in VKORC1 cause warfarin resistance and multiple coagulation factor deficiency type 2. Nature 2004; 427 (6974): 537–541. 23. United States Food and Drug Administration. FDA Approves Updated Warfarin (Coumadin) Prescribing Information. 8–16–2007.
Methods Inf Med 3/2009
289
290
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
24. Sconce EA, Khan TI, Wynne HA, et al. The impact of CYP2C9 and VKORC1 genetic polymorphism and patient characteristics upon warfarin dose requirements: proposal for a new dosing regimen. Blood 2005. 25. Ackoff RL. From Data to Wisdom. Journal of Applied Systems Analysis 1989; 16: 3–9. 26. Aithal GP, Day CP, Kesteven PJ, Daly AK. Association of polymorphisms in the cytochrome P450 CYP2C9 with warfarin dose requirement and risk of bleeding complications. Lancet 1999; 353 (9154): 717–719.
Methods Inf Med 3/2009
27. Guttmacher AE, Collins FS. Welcome to the genomic era. N Engl J Med 2003; 349 (10): 996–998. 28. Mitchell JA, McCray AT, Bodenreider O. From phenotype to genotype: issues in navigating the available information resources. Methods Inf Med 2003; 42 (5): 557–563. 29. Martin-Sanchez F, Maojo V, Lopez-Campos G. Integrating genomics into health information systems. Methods Inf Med 2002; 41 (1): 25–30. 30. Mitchell JA. The impact of genomics on E-health. Stud Health Technol Inform 2004; 106: 63–74.
31. Navigenics. http://www.navigenics.com/. 9–3– 2008. 32. 23andMe. https://www.23andme.com/. 9–3–2008. 33. Klein TE, Altman RB. PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. Pharmacogenomics J 2004; 4: 1. 34. Greenman C, Stephens P, Smith R, et al. Patterns of somatic mutation in human cancer genomes. Nature 2007; 446 (7132): 153–158.
© Schattauer 2009
Original Articles
© Schattauer 2009
Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning S. Tortajada1; J. M. García-Gómez1; J. Vicente1; J. Sanjuán2; R. de Frutos2; R. MartínSantos3; L. García-Esteve3; I. Gornemann4; A. Gutiérrez-Zotes5; F. Canellas6; Á. Carracedo7; M. Gratacos8; R. Guillamat9; E. Baca-García10; M. Robles1 1IBIME,
Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universidad Politécnica de Valencia, Valencia, Spain; 2Faculty of Medicine, Universidad de Valencia, Valencia CIBERSAM, Spain; 3IMIM-Hospital del Mar and ICN-Hospital Clínic, Barcelona CIBERSAM, Spain; 4Hospital Carlos Haya, Málaga, Spain; 5Hospital Pere Mata, Reus, Spain; 6Hospital Son Dureta, Palma de Mallorca, Spain; 7National Genotyping Center, Hospital Clínico, Santiago de Compostela, Spain; 8Center for Genomic Regulation, CRG, Barcelona, Spain; 9Hospital Parc Tauli, Sabadell, Spain; 10Hospital Jiménez Díaz, Madrid CIBERSAM, Spain
Keywords Multilayer perceptron, neural network, pruning, postpartum depression
Summary Objective: The main goal of this paper is to obtain a classification model based on feedforward multilayer perceptrons in order to improve postpartum depression prediction during the 32 weeks after childbirth with a high sensitivity and specificity and to develop a tool to be integrated in a decision support system for clinicians. Materials and Methods: Multilayer perceptrons were trained on data from 1397 women who had just given birth, from seven Spanish
Correspondence to: Salvador Tortajada IBIME-Itaca Universidad Politécnica de Valencia Camino de Vera s/n CP 46022 Valencia Spain E-mail: [email protected]
1. Introduction Postpartum depression (PPD) seems to be a universal condition with equivalent prevalence (around 13%) in different countries [1, 2] which implies an increase in medical care costs. Women suffering from PPD feel a considerable deterioration of cognitive and emotional functions that can affect mother-infant
general hospitals, including clinical, environmental and genetic variables. A prospective cohort study was made just after delivery, at 8 weeks and at 32 weeks after delivery. The models were evaluated with the geometric mean of accuracies using a hold-out strategy. Results: Multilayer perceptrons showed good performance (high sensitivity and specificity) as predictive models for postpartum depression. Conclusions: The use of these models in a decision support system can be clinically evaluated in future work. The analysis of the models by pruning leads to a qualitative interpretation of the influence of each variable in the interest of clinical protocols.
Methods Inf Med 2009; 48: 291–298 doi: 10.3414/ME0562 received: May 15, 2008 accepted: December 8, 2008 prepublished: March 31, 2009
attachment. This may have an impact on the child’s future development until primary school [3]. The identification of women at risk of developing PPD would be of significant use to clinical practice and would enable preventative interventions to be targeted at vulnerable women. Multiple studies have been carried out on PPD. Several psychosocial and biological risk
factors have been suggested concerning its etiology. For instance, social support, partner relationships and stressful life events related to pregnancy and childbirth [4], as well as neuroticism [5] have all been pointed out as being important. With respect to biological factors, it has been shown that inducing an artificial decrease in estrogen can cause depressive symptoms in patients wih PPD antecedents. Cortisol alteration, thyroid hormone changes and a low rate of prolactin are also relevant factors [6]. Treloar et al. conclude in [7], a comparative study with twin samples, that genetic factors would explain 40% of variance in PPD predisposition. In Ross et al. [8], a biopsychosocial model for anxiety and depression symptoms during pregnancy and the PPD period has been developed using structural equations. However, most of the research studies involving genetic factors are separate from those involving environmental factors. There is a remarkable exception that explains that a functional polymorphism in the promoter region of the serotonin transporter gene seems to moderate the influence of stressful life events on depression [9]. An early prediction of PPD may reduce the impact of the illness on the mother, and it can help clinicians to give appropriate treatment to the patient in order to prevent depression. The need for a prediction model rather than a description model is of paramount importance. Thus, artificial neural networks (ANN) have a remarkable ability to characterize discriminating patterns and deMethods Inf Med 3/2009
291
292
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
rive meaning from complex and noisy data sets. They have been widely applied in general medicine for differential diagnosis, classification and prediction of disease, and condition prognosis. In the field of psychiatric disorders, few studies have used ANNs despite their predictive power. For instance, ANNs have been applied to the diagnosis of dementia using clinical data [10] and more recently for predicting Alzheimer’s disease using mixed effects neural networks [11]. EEG data from patients with schizophrenia, obsessive-compulsive disorder and controls has been used to demonstrate that an ANN was able to correctly classify over 80% of the patients with obsessive-compulsive disorder and over 60% of the patients with schizophrenia [12]. In Jefferson et al. [13], evolving neural networks overcome statistical methods in depression prediction after mania. Berdia and Metz [14] have used an ANN to provide a framework for understanding some of the pathological processes in schizophrenia. Finally, Franchini et al. [15] have applied these models to support clinical decision making for the treatment of psychopharmacological therapy. One of the main goals of this paper is to obtain a classification model based on feedforward multilayer perceptrons in order to predict PPD with high, well-balanced sensitivity and specificity during the 32 weeks after childbirth and using pruning methods to obtain simple models. This study is part of a large research project about the environment genetic interaction in postpartum depression [16]. These models can be used later in a decision support system [17] to help clinicians in the prediction and treatment of PPD. A secondary goal is to find and interpret the qualitative contribution of each independent variable in order to obtain clinical knowledge from those pruned models.
Women whose children died after delivery were excluded. This study was approved by the Local Ethical Research Committees, and all the patients gave their informed written consent. Depressive symptoms were assessed with the total score of the Spanish version of the Edinburgh Postnatal Depression Scale (EPDS) [18] just after delivery, at week 8 and week 32 after delivery. Major depression episodes were established using first the EPDS (cut-off point of 9 or more) at 8 or 32 weeks, and then probable cases (EPDS ≥ 9) were evaluated using the Spanish version of the Diagnostic Interview for Genetics Studies (DIGS) [19, 20] adapted to postpartum depression in order to determine if the patient was suffering a depression episode (positive class) or not (negative class). All the interviews were conducted by clinical psychologists with previous common training in the DIGS with video recordings. A high level of reliability (K > 0.8) was obtained among interviewers. From the 1880 women initially included in the study, 76 were excluded because they did not correctly fill out all the scales or questionnaires. With these patients, a prospective study was made just after delivery, at 8 weeks and 32 weeks after delivery. At the 8-week follow-up, 1407 (78%) women remained in the study. At the 32-week follow-up 1397 (77.4%) women were evaluated. We compared the loss of follow-up cases with the remainder of the final sample. Only the lowest social class was significantly increased in the loss of follow-up cases (p = 0.005). A total of 11.5% (160) of women at baseline, 8 weeks and 32 weeks had a major depressive episode during the eight months of postpartum follow-up. Hence, from a total number of 1397 patients, we had 160 in the positive class and 1237 in the negative class.
2. Materials and Methods
2.1 Independent Variables
Data from postpartum women were collected from seven Spanish general hospitals, in the period from December 2003 to October 2004 on the second to third day after delivery. All the participants were Caucasian, none of them were under psychiatric treatment during pregnancy, and all of them were able to read and answer the clinical questionnaires.
Based on the current knowledge about PPD, several variables were taken into account in order to develop predictive models. In a first step, psychiatric and genetic information was used. These predictive models are called subject models. Then, social-demographic variables were included in the subject-environment models. For each approach, we used
Methods Inf Med 3/2009
EPDS (just after childbirth) as an input variable in order to measure depressive symptoms. 씰Table 1 shows the clinical variables used in this study. All participants completed a semistructured interview that included socio-demographic data: age, education level, marital status, number of children and employment during pregnancy. Personal and family history of psychiatric illness (psychiatric antecedents) and emotional alteration during pregnancy were also recorded. Both are binary variables (yes/no). Neuroticism can be defined as an enduring tendency to experience negative emotional states. It is measured on the Eysenck Personality Questionnaire short scale (EPQ) [21], which is the most widely used personality questionnaire, and consists of 12 items. For this study, the validated Spanish version [22] was used. Individuals who score high on neuroticism are more likely than the average to experience such feelings as anxiety, anger, guilt and depression. The number of experiences are the number of stressful life events of the patient just after delivery, at an interval of 0–8 weeks and at an interval of 8–32 weeks using the St. Paul Ramsey Scale [23, 24]. This is an ordinal variable and depends on the patient’s point of view. Depressive symptoms just after delivery were evaluated by EPDS. It is a 10-item, selfreport scale, and it has been validated for the Spanish population [18]. The best cut-off of the Spanish validation of the EPDS was 9 for postpartum depression. We decided to prove its initial value (i.e., at the moment of birth) as an independent variable because the goal is to prevent and predict postpartum depression within 32 weeks. Social support is measured by means of the Spanish version of the Duke UNC social support scale [25], which originally consists of 11 items. This questionnaire is rated just after delivery, at 6–8 weeks and at week 32. For this work, the variable used was the sum of the scores obtained immediately after childbirth plus the scores obtained in week 8. Since we wanted to predict possible depression risk during the first 32 weeks after childbirth, the Duke score at week 32 was discarded for this experiment. Genomic DNA was extracted from the peripheral blood of women. Two functional © Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
polymorphisms of the serotonin transporter gene were analyzeda. For the entire machine learning process, we decided to use the combination genotypes (5-HTT-GC) proposed by Hranilovic in [26]: no low-expressing genotype at either of the loci (HE); low-expressing genotype at one of the loci (ME); lowexpressing genotypes at both loci (LE). The medical perinatal risk was measured as seven dichotomous variables: medical problems during pregnancy, use of drugs during pregnancy (including alcohol and tobacco), cesarean, use of anesthesia during delivery, mother medical problems during delivery, medical problems with more admission days in hospital, and newborn medical problems. A two-step cluster analysis was done in order to explore these seven binary variables. This analysis provides an ordinal variable with four values for every woman: no medical perinatal risk, pregnancy problems without delivery problems, pregnancy problems and delivery mother problems, and presence of both other and newborn problems. Other psychosocial and demographic variables were considered in the subject-environment model such as age, the highest level of education achieved rated on a 3-point scale (low, medium, high), labor situation during pregnancy, household income rated on a 4-point scale (economical level), the gender of the baby, or the number of family members who live with the mother. Every input variable was normalized in the range [0, 1]. Non-categorical variables were represented by one input unit. Missed variables were replaced by their mean, if they were continuous, or by their mode, if they were discrete. A dummy representation was used for each categorical variable, i.e., one unit represents one of the possible values of the variable, and this unit is activated only when the corresponding variable takes this value. Missed variables were simply represented by not activating any of the units.
2.2 ANNs Theoretical Model
Table 1 There are 160 cases with postpartum depression (PPD) and 1237 cases without it. The second column shows the number of missing values for each independent variable, where ’– ’ indicates no missing value. The last two columns show the number of patients in each class. For categorical variables, the number of patients (percentage) is shown. For non-categorical variables, the mean ± standard deviation is presented. Input Variable Psychiatric antecedents
No. miss. No PPD 76
No
790 (90.3%)
85 (9.7%)
Yes
374 (83.9%)
72 (16.1%)
Emotional alterations during pregnancy
–
No
73 (81.1%)
17 (18.9%)
Yes
1164 (89.1%)
143 (10.9%)
Neuroticism (EPQN)
6
3.25 ± 2.73
5.68 ± 3.55
Life events after delivery
2
0.99 ± 1.06
1.40 ± 1.09
Life events at 8 weeks
176
0.88 ± 1.09
1.69 ± 1.33
Life events at 32 weeks
64
0.87 ± 1.07
1.95 ± 1.53
Depressive symptoms (Initial EPDS)
–
5.64 ± 3.97
8.96 ± 4.85
Social support (DUKE)
10
88.06 ± 56.27
138.63 ± 82.45
5-HTT-GC
79
HE, No low-expressing genotype
93 (83.8%)
18 (16.2%)
ME, Low-expressing genotype at one locus
664 (87.5%)
95 (12.5%)
LE, Low-expressing genotype at both loci
408 (91.1%)
40 (8.9%)
No problems
376 (88.1%)
51 (11.9%)
Pregnancy problems
426 (86.1%)
69 (13.2%)
Mother problems
117 (89.3%)
14 (10.7%)
318 (92.4%)
26 (7.6%)
Medical Perinatal Risk
–
Mother and child problems Age Educational level
–
5-HTTLPR in the promoter region and STin2 within intron 2
© Schattauer 2009
31.89 ± 4.96
2 324 (85.5%)
55 (14.5%)
Medium
518 (88.5%)
67 (11.5%)
393 (91.2%)
38 (8.8%)
Employed
879 (91.1%)
86 (8.9%)
Unemployed
136 (86.1%)
22 (13.9%)
Student/Housewife
103 (85.1%)
18 (14.9%)
116 (77.9%)
33 (22.1%)
Suitable income
830 (90.9%)
83 (9.1%)
Enough income
311 (85.9%)
51 (14.1%)
73 (79.3%)
19 (20.7%)
7 (53.8%)
6 (46.2%)
599 (89.7%)
69 (10.3%)
623 (87.6%)
88 (12.4%)
High Labor situation during pregnancy
4
Leave Economical level
17
Economical problems Gender of the baby
18
Male Female
a
36.12 ± 4.42
Low
Tight income
ANNs are inspired by biological systems in which large numbers of simple units work in parallel to perform tasks that conventional
PPD
Number of people living together
31
2.67 ± 0.96
2.66 ± 0.77
Methods Inf Med 3/2009
293
294
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
computers have not been able to tackle successfully. These networks are made of many simple processors (neurons or units) based on Rosenblatt’s perceptron [27]. A perceptron gives a linear combination, y, of the values of its D inputs, xi , plus a bias value,
y=
x i wi + θ.
(1)
The output, z = f (y), is calculated by applying an activation function to the input. Generally, the activation function is an identity, a logistic or a hyperbolic tangent. As these functions are monotonic, the form f (y) still determines a linear discriminant function [28]. A single unit has a limited computing ability, but a group of interconnected neurons has a very powerful adaptability and the ability to learn non-linear functions that can model complex relationships between inputs and outputs. Thus, more general functions can be constructed by considering networks having successive layers of processing units, with connections running from every unit in one layer to every unit in the next layer only. A feedforward multilayer perceptron consists of an input layer with one unit for every independent variable, one or two hidden layers of perceptrons and the output layer for the dependent variable (in the case of a regression problem), or the possible classes (in the case of a classification problem). We call a fully connected feed-forward multilayer perceptron when every unit of each layer receives an input from every unit in its precedent layer and the output of each unit is sent to every unit in the next layer. Since PPD is considered in this work as a binary dependent variable, the activation function of the output unit was the logistic function, while the activation function of the hidden units was the hyperbolic tangent. As a first approach, fully connected feedforward multilayer perceptrons were used with one or two hidden layers. The learning algorithm backpropagation with momentum was used to train the networks. The connection weights of the network were updated following the descent gradient rule [29]. Although these models, and ANNs in general, exhibit a superior predictive power compared to traditional approaches, they have been labeled as “black box” methods because Methods Inf Med 3/2009
they provide little explanatory insight into the relative influence of the independent variables in the prediction process. This lack of explanatory power is a major concern in achieving an interpretation of the influence of each independent variable on PPD. In order to gain some qualitative knowledge of the causal relationships about depression phenomena, we used several pruning algorithms to obtain more simple and interpretable models [30, 34].
2.2.1 Pruning Algorithms Based on the fundamental idea in Wald statistics, pruning algorithms estimate the importance of a parameter (or weight) in the model by how much the training error increases if that parameter is eliminated. Then, it removes the least relevant one and continues iteratively until some convergence condition is reached. These algorithms were initially thought of as a way to achieve a good generalization for connectionist models, i.e., the ability to infer a correct structure from training examples and to perform well on future samples. A very complex model can lead to poor generalization or overfitting, which happens when it adjusts to specific features of the training data rather than to the general ones [31]. But pruning has also been used for feature selection with neural networks [32, 33], making their operation easier to understand since there is less opportunity for the network to spread functions over many nodes. This is important in this critical application where knowing how the system works is a major concern. The algorithms used here are based on weight pruning. The strategy consists in deleting parameters with small saliency, i.e. those whose deletion will have
Table 2 Summary of the nature of the variables as being a risk factor or a protective factor depending on the sign of the weights of the inputhidden connection (I-H) and the hidden-output connection (H-O).
I-H
H-O
Factor
+
+
Risk
+
–
Protective
–
+
Protective
–
–
Risk
the least effect on the training error. The Optimal Brain Damage (OBD) algorithm [30] and its descendant, Optimal Brain Surgeon (OBS) [34], use a second-order approximation to predict the saliency of each weight. Pruned models were obtained from fully connected feed-forward neural networks with two hidden units, i.e., there was initially a connection between every unit from a layer and every unit of each consecutive layer. In order to select the best pruned architecture, a validation set was used to compare the networks. Then, when the best model was obtained, the interpretation of the influence of each variable was done in the following way: if an input unit is directly connected to the output unit, then a positive weight means that it is a risk factor as it increases the probability of having depression. Thus, a negative weight means that the variable is a protective factor. Let a hidden unit be connected to the output unit with a positive weight. If an input unit is connected to this hidden unit with a positive value, then the variable represented by this unit is a risk factor. If its weight is negative, then it is a protective factor. On the contrary, if the weight between the hidden unit and the output unit is negative, then a positive value in the connection between the input and the hidden unit means that the variable is a protective factor. Thus, a negative value in the weight that connects the input to the hidden unit means that it is a risk factor. 씰Table 2 summarizes these influences. This interpretation is justified because the hidden units have a hyperbolic tangent as an activation function which delimits its output activation values between –1 and 1.
2.2.2 Comparison with Logistic Regression The significant variables obtained by the pruned models were compared to the ones obtained by logistic regression models. The latter models are used when the dependent variable is categorical with two possible values. Independent variables may be in numerical or categorical form. The logistic function can be transformed using the logit transformation into a linear model [35]:
g (x) =
βi x i + β0 .
(2)
© Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
The log-likelihood is used for estimating regression coefficients (βi) in the model. Thus, the exponential values of the regression coefficients give the odds ratio value, which reflects the effect of the input variables as a risk or a protective factor. To assess the significance of an independent variable, we compare the value of the likelihood of the model with and without the variable. This comparison follows a chi-square distribution with one degree of freedom, so it is possible to find the associated p-value. Thus, we have the statistical significance and the character of each factor as being a protective one or a risk one. A noteworthy fact is that the logistic regression models are limited to linear relationships between dependent and independent variables. The neural network models can overcome this restriction. Thus, the linear relationships between independent variables and the target should be found in both models. While non-linear interactions will appear only in the connectionist model.
2.3 Evaluation Criteria The evaluation of the models was made using a hold-out validation where the observations were chosen randomly to form the validation and the evaluation sets. In order to obtain a good error estimation of the predictive model, this database had to be split into three different datasets: the training set with 1006 patients (72%), the validation set with 112 patients (8%), and the test set with 279 patients (20%). Each partition followed the prevalence of the original database (씰see Table 3). The best network architecture and parameters were selected empirically using the validation set and then evaluated with the test set. Overfitting was avoided using the validation set to stop the learning procedure when the validation medium square error function reached its minimum. Section 3 shows that using a single hidden layer was enough to obtain a good predictive model. There is an intrinsic difficulty in the nature of the problem: the dataset is imbalanced [36, 37], in the sense that the positive class is underrepresented compared to the negative class. Thus, with this prevalence on the negative examples (89%), a trivial classifier consisting in assigning the most prevalent © Schattauer 2009
Table 3 Number of samples per class of each partition of the original database. The prevalence of the original dataset is observed in each one: 11% for the positive class (major postpartum depression) and 89% for the negative class (no depression). Dataset
No depression
Major depression
Total
891
115
1006
Validation
99
13
112
Evaluation
247
32
279
1237
160
1397
Training
Total
means that the model is the worst we can obtain.
3. Results 씰Table 4 shows the results of the best con-
class to a new sample would achieve an accuracy of around 89%, but its sensitivity would be null. The main goal is to obtain a predictive model with a good sensitivity and specificity. Both measures depend on the accuracy on positive examples, a+, and the accuracy on negative examples, a–. Increasing a+ will be done at the cost of decreasing a–. The relation between these quantities can be captured by the ROC (Receiver Operating Characteristic) curve [38]. The larger the area under the ROC curve (AUC), the higher the classification potential of the model. This relation can also be estimated by the geometric mean of the two accuracies, G = √a+ · a–, reaching high values only if both values are high and in equilibrium. Thus, if now we use the geometric mean to evaluate our trivial model (which always assigns the class with the maximum a priori probability) we see that G = 0, which
nectionist models obtained from the first approach. Two models were trained based on differential input variables: the subject model (SUBJ) and the subject-environment model (SUBENV). Both models included psychiatric antecedents, emotional alterations, neuroticism, life events, depressive symptoms, genetic factors, social support and medical perinatal risk. The SUBENV also included social and demographic features, such as age, economical and educational level, family members and labor situation. The best model (SUBJ with no pruning) achieved 0.82 of G and 0.81 of accuracy (95% CI: 0.76- 0.86) with 0.84 of sensitivity and 0.81 of specificity. In general, SUBENV and non-pruned models tend to have a better behavior than SUBJ and pruned ones, but a χ2 test with Bonferroni correction shows that there is no statistical significance. Also, notice that the accuracy confidence intervals are overlapped (씰see Table 4). On the other hand, the use of pruning methods leads to a more understandable model at the expense of a small loss of sensitivity. A logistic regression has been done for the SUBJ and SUBENV sets of variables to compare and confirm the significant influence of the pruned selected features. It is expected that the linear relationships between independent variables and the target should be found
Table 4 Results for the best models with the subject feature models (SUBJ) and the subject-environment feature models (SUBENV). We show the G-mean, the accuracy of the model with its confidence intervals at 5% of significance and its sensitivity and specificity. Varying the threshold of the classifier we obtain a continuous classifier for which the AUC value is shown. The architecture points out the number of input units, hidden units and the output unit. When pruning a network, we see that some input variables were discarded because their connections towards any hidden unit were eliminated. Thus, these pruned models are simpler than the original ones and may be more interpretable, although they might lose some sensitivity. Model
Pruning
Architecture
G
Acc (95% CI)
Sen
Spe
AUC
SUBJ
No
16–14–1
0.82
0.81 (0.76, 0.86)
0.84
0.81
0.82
SUBENV
No
31–3–1
0.81
0.84 (0.80, 0.88)
0.78
0.85
0.84
SUBJ
Yes
9–1–1
0.77
0.78 (0.73, 0.83)
0.75
0.78
0.80
SUBENV
Yes
13–2–1
0.80
0.84 (0.80, 0.88)
0.75
0.84
0.84
Methods Inf Med 3/2009
295
296
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
in the logistic regression as well as in the neural network models. In the best pruned SUBJ model the most relevant features appear as statistically significant (α = 0.05) in the logistic regression model. Neuroticism, life events from week 8 to week 32, social support and depressive symptoms are considered risk factors. Moreover, the influence of the 5-HTT-GC combination of low-expressing genotypes, LE, is also significant and appears as a protective factor. The rest of the input variables in the logistic regression model: emotional alterations, psychiatric antecedents, pregnancy problems and the 5-HTT-GC combination of no low-expressing genotype, HE, are not significant, but in the pruned model, these four variables are seen as risk factors. The difference between significant factors of the pruned models and logistic regression may be explained by non-linear interactions of a higher order between variables because the indepen-
dent variables interact with each other as explained in Section 2.2. Considering the SUBENV model, most of the relevant features appear as significant input variables in the logistic regression: social support, neuroticism, life events from week 8 to week 32, depressive symptoms, leave labor situation and female baby are risk factors in both models. Pregnancy problems for the mother and the baby appears as a protective factor, which is explained by the proportion of mothers with postpartum depression in the observations (see Table 1). On the other hand, the age and the number of people that the patient lives with appear as protective factors in both models, but they have no statistical significance in the regression model, whereas psychiatric antecedents is a risk factor without statistical significance. Again, we find these differences due to the interactions between variables as explained before.
Table 5 Independent variables selected for the SUBJ pruned model and the SUBENV pruned model for PPD. risk: risk factor; protect: protective factor; pruned: pruned variable. The table shows which variables were significant for the pruned models and for the logistic regression. If a variable is pruned in the neural network, then it is not considered significant. In the case of a logistic regression, a variable is significant if and only if the p-value < 0.05. As expected, every significant variable in logistic regression was also significant in the neural network model. Variable
SUBJ
SUBENV
Pruned Net
Log Reg (p-value)
Pruned Net
Log Reg (p-value)
Neuroticism (EPQN)
risk
risk(= 0.004)
risk
risk(= 0.004)
Social support (DUKE)
risk
risk(< 0.001)
risk
risk(< 0.001)
Depressive symptoms (Initial EPDS)
risk
risk(< 0.001)
risk
risk(< 0.018)
5-HTT-GC, HE
risk
not significant
pruned
not significant
5-HTT-GC, LE
protect
protect (= 0.041)
pruned
not significant
Emotional alteration
risk
not significant
pruned
not significant
Psychiatric antecedents
risk
not significant
risk
not significant
Pregnancy problems
risk
not significant
pruned
not significant
Life events at 32 week
risk
risk(< 0.001)
risk
risk(< 0.001)
Life events at 8 week
risk
not significant
risk
not significant
Gender girl
–
–
risk
risk(< 0.007)
Labor leave
–
–
risk
risk(< 0.008)
Labor active
–
–
protect
protect (< 0.026)
Mother-child problems
–
–
protect
protect (= 0.008)
Age
–
–
protect
not significant
No. of people living together
–
–
protect
not significant
Methods Inf Med 3/2009
In 씰Table 5, the SUBJ model shows that neuroticism, social support, life events and depressive symptoms are the most outstanding features and that they are risk factors in the prediction of PPD. In the SUBENV model these variables are also main risk factors, but the variable age and the number of people that the patient lives with are both protective factors although in the regression model they have no statistical significance.
4. Discussion The main objective of this study was to fit a feed-forward ANN classification model to predict PPD with a high sensitivity and specificity during the first 32 weeks after the delivery. The predictive model showing the best G was selected ensuring a balanced sensitivity and specificity, as Table 4 shows. With this model, we achieved around 81% of accuracy. From our results, SUBENV models did not significantly improve SUBJ models for prediction. The major concern for the medical staff is how the PPD is influenced by the variables. These independent variables have different influences on the output of the classification model and they depend on the connections between nodes. While logistic regression models detect only linear relationships between the independent variables and the dependent variable, the neural network models can also detect non-linear relationships. Thus, the comparison with logistic regression aims to confirm that the neural network model is not inferring wrong linear influences between independent variables and the dependent variable. We expect that if a linear relationship is found to be significant in the logistic regression model then it should be also considered by the neural network pruned model. But non-linear relationships are only going to be detected by the neural network model since logistic regression cannot detect these relations. In the case where the logistic regression finds an independent variable as significant but the neural network fails to detect it, then it would be an evidence of a wrong trained model. But this situation was not found in this work as Table 5 shows. In future work, some quantitative techniques will be used in order to achieve a numeric measure of the influence of each input feature © Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
and its interactions following rule extraction methods [39] or numeric methods [40] for ANNs. Therefore, these prevention models would give the clinicians a tool to gain knowledge on the PPD. A classification model with this good performance, i.e., high accuracy, sensitivity and specificity, may be very useful in clinical settings. In fact, the ability of neural networks to tolerate missing information could be relevant when part of the variables are missing thus giving a high reliability in the clinical field. Since no comparison was established with other machine-learning techniques, it could be interesting to try Bayesian network models as they can also deal with missing information, find probabilistic dependencies and show good performance [41]. However, our models provided better results than the work done by Camdeviren et al. in [42] on the Turkish population. Although the number of patients was comparable, our study included more independent variables than Camdeviren’s study, where a logistic regression model and a classification tree were compared to predict PPD. Based on logistic regression, they reached an accuracy of 65.4% with a sensitivity of 16% and a specificity of 95%, which means a G of 0.39. With the optimal decision tree, they obtained an accuracy of 71%, a sensitivity of 22% and a specificity of 94%, which gives a G of 0.45. As they explained, there is also a maximal tree that is very complex and overfitted, thus the generalization of this tree is very limited. In the best model achieved, neuroticism, life events, social support and depressive symptoms just after delivery were the most important risk factors for PPD. Therefore, women with high levels of neuroticism, depressive symptoms during pregnancy and high HTT genotype are the most likely to suffer from PPD. In this subgroup, a careful postpartum follow-up should be considered in order to improve the social support and help to cope with the life events [43]. In a long term, the final goal is the improvement of clinical management of patients with possible PPD. In this sense, ANN models have been shown to be valuable tools by providing decision support, thus reducing the workload on clinicians. The practical solution to integrate the pattern recognition developments in the clinical routine workflows is the design of clinical decision support systems (CDSSs) © Schattauer 2009
taking into account also clinical guidelines and user preferences [44].There are relatively few published clinical trials and they need more rigorous methodologies of evaluation, but the general conclusion is that CDSSs can improve practitioners performances [45, 46]. In conclusion, four models for predicting PPD have been developed using multilayer perceptrons. These models have the ability to predict PPD during the first 32 weeks after delivery with high accuracy. The use of G as a measure for selecting and evaluating the models yields to a high, well-balanced sensitivity and specificity. Moreover, pruning methods can lead to simpler models, which can be easier to analyze in order to interpret the influence of each input variable on PPD. Finally, the models achieved should be incorporated, integrated and clinically evaluated in a CDSS [17] to give this knowledge to clinicians and improve the prevention and premature detection of PPD.
Acknowledgments This work was partially funded by the Spanish Ministerio de Sanidad (PIO41635, Vulnerabilidad genético-ambiental a la depresión posparto, 2006–2008) and the Instituto de Salud Carlos III (RETICS Combiomed, RD07/0067/2001). The authors acknowledge to Programa Torres Quevedo from Ministerio de Educación y Ciencia, co-founded by the European Social Fund (PTQ05–02–03386 and PTQ-08–01–06802).
References 1. Oates MR, Cox JL, Neema S, Asten P, GlangeaudFreudenthal N, Figueiredo B, et al. Postnatal depression across countries and cultures: a qualitative study. British Journal of Psychiatry 2004; 46 (Suppl): s10–s16. 2. O’Hara MW, Swain AM. Rates and risk of postnatal depression – a meta analysis. International Review of Psychiatry 1996; 8: 37–54. 3. Cooper PJ, Murray L. Prediction, detection and treatment of postnatal depression. Archives of Disease in Childhood 1997; 77: 97–99. 4. Beck CT. Predictors of postpartum depression: an update. Nursing Research 2001; 50: 275–285. 5. Kendler KS, Kuhn J, Prescott CA. The interrelationship of neuroticism, sex and stressful life events in the prediction of episodes of major depression. American Journal of Psychiatry 2004; 161: 631–636. 6. Bloch M, Daly RC, Rubinow DR. Endocrine factors in the etiology of postpartum depression. Comprehensive Psychiatry 2003; 44: 234–246.
7. Treloar SA, Martin NG, Bucholz KK, Madden PAF, Heath AC. Genetic influences on post-natal depressive symptoms: findings from an Australian twin sample. Psychological Medicine 1999; 29: 645–654. 8. Ross LE, Gilbert EM, Evans SE, Romach MK. Mood changes during pregnancy and the postpartum period: development of a biopsychosocial model. Acta Psychiatrica Scandinavica 2004; 109: 457–466. 9. Caspi A, Sugden K, Moffitt TE, Taylor A, Craig IW, Harrington H, et al. Influence of life stress on depression: moderation by a polimorphism in the 5-HTT gene. Science 2003; 301: 386–389. 10. Mulsant BH, Servan-Schreiber E. A connectionist approach to the diagnosis of dementia. In: Proc. 12th Annual Symposium on Computer Applications in Medical Care; 1988. pp 245–249. 11. Tandon R, Adak S, Kaye JA. Neural networks for longitudinal studies in Alzheimers disease. Artificial Intelligence in Medicine 2006; 36: 245–255. 12. Zhu J, Hazarika N, Chung-Tsoi A, Sergejew A. Classification of EEG signals using wavelet coefficients and an ANN. In: Pan Pacific Conference on Brain Electric Topography. Sydney, Australia; 1994. p 27. 13. Jefferson MF, Pendleton N, Lucas CP, Lucas SB, Horan MA. Evolution of artificial neural network architecture: prediction of depression after mania. Methods Inf Med 1998; 37: 220–225. 14. Berdia S, Metz JT. An artificial neural network stimulating performance of normal subjects and schizophrenics on the Wisconsin card sorting test. Artificial Intelligence in Medicine 1998; 13: 123–138. 15. Franchini L, Spagnolo C, Rossini D, Smeraldi E, Bellodi L, Politi E. A neural network approach to the outcome definition on first treatment with sertraline in a psychiatric population. Artificial Intelligence in Medicine 2001; 23: 239–248. 16. Sanjuán J, Martín-Santos R, García-Esteve L, Carot JM, Guillamat R, Gutiérrez-Zotes A, et al. Mood changes after delivery: role of the serotonin transporter gene. British Journal of Psychyatry 2008; 193: 383–388. 17. Vicente J, García-Gómez JM, Vidal C, Martí-Bonmatí L, del Arco A, Robles M. SOC: A distributed decision support architecture for clinical diagnosis. Biological and Medical Data Analysis; 2004. pp 96–104. 18. García-Esteve L, Ascaso L, Ojuel J, Navarro P. Validation of the Edinburgh Postnatal Depression Scale (EPDS) in Spanish mothers. Journal of Affective Disorders 2003; 75: 71–76. 19. Nurnberger JI, Blehar MC, Kaufmann C, YorkCooler C, Simpson S, Harkavy-Friedman J, et al. Diagnostic interview for genetic studies and training. Archives of Genetic Psychiatry 1994; 51: 849–859. 20. Roca M, Martin-Santos R, Saiz J, Obiols J, Serrano MJ, Torrens M, et al. Diagnostic Interview for Genetic Studies (DIGS): Inter-rater and test-retest reliability and validity in a Spanish population. European Psychiatry 2007; 22: 44–48. 21. Eysenck HJ, Eysenck SBG. The Eysenck Personality Inventory. London: University of London Press; 1964. 22. Aluja A, García O, García LF. A psychometric analysis of the revised Eysenck Personality Questionnaire short scale. Personality and Individual Differences 2003; 35: 449–460.
Methods Inf Med 3/2009
297
298
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
23. Paykel ES. Methodological aspects of life events research. Journal of Psychosomatic Research 1983; 27: 341–352. 24. Zalsman G, Huang YY, Oquendo MA, Burke AK, Hu XZ, Brent DA, et al. Association of a triallelic serotonin transporter gene promoter region (5-HTTLPR) polymorphism with stressful life events and severity of depression. American Journal of Psychiatry 2006; 163: 1588–1593. 25. Bellón JA, Delgado A, Luna JD, Lardelli P. Validity and reliability of the Duke-UNC-11 questionnaire of functional social support. Atención Primaria 1996; 18: 158–163. 26. Hranilovic D, Stefulj J, Schwab S, Borrmann-Hassenbach M, Albus M, Jernej B, et al. Serotonin transporter promoter and intron 2 polymorphisms: relationship between allelic variants and gene expression. Biological Psychiatry 2004; 55: 1090–1094. 27. Rosenblatt F. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958; 65 (6): 386–408. 28. Bishop CM. Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press; 1995. 29. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. The MIT Press 1986. pp 318–362. 30. Le Cun Y, Denker JS, Solla A. Optimal brain damage. Advances in Neural Information Processing Systems 1990; 2: 598–605.
Methods Inf Med 3/2009
31. Duda RO, Hart PE, Stork DG. Pattern Classification. New York, NY: Wiley-Interscience; 2001. 32. Mao J, Jain AK. Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks. 1995; 6 (2): 296–317. 33. Leray P, Gallinari P. Feature selection with neural networks. Behaviormetrika 1999; 26: 145–166. 34. Hassibi B, Stork DG, Wolf G. Optimal brain surgeon and general network pruning. In: Proceedings of the 1993 IEEE International Conference on Neural Networks. San Francisco, CA; 1993. pp 293–300. 35. Hosmer DW, Lemeshow S. Applied logistic regression. Wiley-Interscience; 2000. 36. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proc. 14th International Conference on Machine Learning. Morgan Kaufmann; 1997. pp 179–186. 37. Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent Data Analysis Journal 2002; 6 (5): 429–449. 38. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters 2006; 27 (8): 861–874. 39. Saad EW, Wunsch DC. Neural network explanation using inversion. Neural Networks. 2007; 20 (1): 78–93. 40. Heckerling PS, Gerber BS, Tape TG, Wigton RS. Entering the black box of neural networks. A descrip-
41.
42.
43.
44.
45.
46.
tive study of clinical variables predicting community-acquired pneumonia. Methods Inf Med 2003; 42: 287–296. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the bayesian network model. Methods Inf Med 2007; 46: 723–726. Camdeviren HA, Yazici AC, Akkus Z, Bugdayci R, Sungur MA. Comparison of logistic regression model and classification tree: an application to postpartum depression data. Expert Systems with Applications 2007; 32: 987–994. Dennis CL. Psychosocial and psychological interventions for prevention of postnatal depression: systematic review. BMJ 2005; 331 (7507): 15. Fieschi M, Dufour JC, Staccini P, Gouvernet J, Bouhaddou O. Medical Decision Support Systems: Old dilemmas and new paradigms? Methods Inf Med 2003; 42: 190–198. Lisboa PJ, Taktak AFG. The use of artificial neural networks in decision support in cancer: a systematic review. Neural Networks 2006; 19(4): 408–415. Kawamoto K, Houl han CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 2005. bmj.38398.500764.8F.
© Schattauer 2009
Original Articles
© Schattauer 2009
A Simple Modeling-free Method Provides Accurate Estimates of Sensitivity and Specificity of Longitudinal Disease Biomarkers F. Subtil1, 2, 3, 4; C. Pouteil-Noble2, 3, 5; S. Toussaint2, 3, 5; E. Villar2, 3, 5; M. Rabilloud1, 2, 3, 4 1Hospices
Civils de Lyon, Service de Biostatistiques, Lyon, France; de Lyon, Lyon, France; 3Université Lyon 1, Villeurbanne, France; 4CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Equipe Biostatistique Santé, Pierre-Bénite, France; 5Hospices Civils de Lyon, Service de Néphrologie-Transplantation, Centre Hospitalier Lyon-Sud, Pierre-Bénite, France 2Université
Keywords Sensitivity and specificity, prognosis, early diagnosis, longitudinal study, biological markers
Summary Objective: To assess the time-dependent accuracy of a continuous longitudinal biomarker used as a test for early diagnosis or prognosis. Methods: A method for accuracy assessment is proposed taking into account the marker measurement time and the delay between marker measurement and outcome. It dealt with markers having interval-censored measurements and a detection threshold. The threshold crossing times were assessed by a Bayesian method. A numerical study was conducted to test the procedures that were later applied to PCR measurements for prediction
Correspondence to: Fabien Subtil Hospices Civils de Lyon – Service de Biostatistique 162 avenue Lacassagne 69003 Lyon France E mail: [email protected]
1. Introduction Today, disease diagnosis is made not only on traditional clinical observations, but also on laboratory results; for example, fluorescence polarization, a measure of cellular functionality, is used to make the diagnosis of breast cancer [1]. Methods have been developed to use those results as diagnostic tests and to compare their accuracies [2–4]. Mo-
of cytomegalovirus disease after renal transplantation. Results: The Bayesian method corrected the bias induced by interval-censored measurements on sensitivity estimates, with corrections from 0.07 to 0.3. In the application to cytomegalovirus disease, the Bayesian method estimated the area under the ROC curve to be over 75% during the first 20 days after graft and within five days between marker measurement and disease onset. However, the accuracy decreased quickly as that delay increased and late after graft. Conclusions: The proposed Bayesian method is easy to implement for assessing the timedependent accuracy of a longitudinal biomarker and gives unbiased results under some conditions.
Methods Inf Med 2009; 48: 299–305 doi: 10.3414/ME0583 received: July 1, 2008 accepted: December 12, 2008 prepublished: March 31, 2009
lecular biology has also contributed to the improvement of early diagnosis or prognosis of diseases. Recent research fields, as in genomics or proteomics, led to the development of numerous biomarkers for early diagnosis or prognosis [5, 6]. During patient follow-up, it became frequent to collect repeated measurements of a quantitative biomarker such as the CA19-9 antigen in screening for recurrence of colorectal cancer [7]. The prog-
nostic value of such longitudinal clinical biomarkers has to be carefully assessed and analyzed [8, 9]. For a clinician, a biomarker is useful if it has a good discriminant accuracy and if its test becomes positive early enough to allow an efficient reaction between marker measurement and the disease clinical manifestation. Thus, the progression of a biomarker’s accuracy along the delay from marker measurement and disease onset is of major interest. A marker load may also vary along the time elapsed since inclusion of a patient into a study regardless of the progression toward disease. Consequently, accuracy analyses should take into account both the marker measurement time and the delay between marker measurement and disease onset. When a marker is measured with the disease present, it is conventional to use a ROC curve to summarize the accuracy of continuous or ordinal tests [9–12]. That curve displays the relationship between sensitivity (true-positive rate) and 1-specificity (falsepositive rate) across all possible threshold values set for that test. The test accuracy is then measured by the area under the ROC curve (AUC). This area, comprised between 0 and 1, may be interpreted as the probability that the diagnostic test result in a diseased subject exceeds that result in a non-diseased one (for a complete review of classical diagnostic methods, see Pepe [3] and Zhou et al. [13]). Recently, several methods have been proposed to assess the time-dependent accuracy of a biomarker when the measurements are repeated before disease onset [14–20]. A first approach consists in modeling semi-parametrically the time-dependent sensitivity and Methods Inf Med 3/2009
299
300
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
specificity or the ROC curve itself [16, 17]; the model's validity may be checked with methods proposed by Cai and Zheng [21]. A second approach models survival conditional on the marker values [18–20]. A third approach models the marker distribution conditional on the disease status [14, 15]. In each of the previous models, effects related to marker measurement time and to the delay between marker measurement and outcome are introduced. In their comprehensive and very instructive review on the subject, Pepe et al. [22] recommended sensitivity be assessed on events that occur exactly t days after marker measurement (incident sensitivity) and not over a delay following the measurement (cumulative sensitivity). Also, they recommended specificity be evaluated in subjects with follow-up long enough to be considered as subjects who will not develop the disease (static specificity). Five out of the six above-mentioned methods [14–18, 20] use this definition of time-dependent accuracy. However, those methods require sophisticated models that are not currently available in standard statistical softwares. Considering those facts, we developed a simple method to assess the time-dependent accuracy of a longitudinal biomarker using a Bayesian approach. In agreement with the recommendations of Pepe et al., that method takes into account interval-censored measurements and, possibly, biomarkers with a detection threshold. The first section of the present article describes the method. Numerical studies were conducted in order to compare the results obtained with and without consideration of the sparse nature of the measurements. The method is also illustrated by an analysis of data stemming from a clinical study where patients were screened by PCR measurements to predict cytomegalovirus (CMV) disease after renal transplantation.
incident sensitivity definition was used here [23]: cases correspond to patients who develop the disease exactly t time units after marker measurement. Thus, for a t time units delay between marker measurement and outcome, sensitivity is estimated with measurements taken exactly t time units before the outcome. Sensitivity is assessed at different delays t to assess its progression along the delay from marker measurement to outcome. In this article, a positive test is defined as a marker value higher than or equal to a certain threshold (though equal or lower values may be elsewhere considered). If Yi (s) denotes a measurement relative to patient i at time s since his inclusion into the study, and Ti the event onset time, the incident sensitivity for a delay t between marker measurement and outcome and for a threshold c may be formalized as: Sensitivity (c, t) = P [Yi (s) ≥ c | Ti – s = t ] The progression of sensitivity along t reflects the test ability of early prediction of the outcome. Controls are defined as subjects who do not develop the disease τ days after inclusion into the study, τ being a fixed delay, long enough to consider as controls patients who will probably never develop the disease. Specificity is estimated using measurements in those patients, which leads to static specificity estimates. A possible progression of specificity after inclusion may be taken into account by estimating specificity using, in the controls, the measurements taken at different periods after inclusion. For each subject of the control group, the highest measurement obtained during the period [sj , sj + 1] is kept, sj and sj + 1 denoting successive times since inclusion. The definition of specificity may be formalized as: Specificity (c, τ, sj , sj + 1) =
case in most studies. A first method, called the crude method, consists in using for each cases the last value obtained before Ti – t , introducing a bias because the delay between marker measurement and Ti – t might vary widely from one patient to another. Because of measurements sparsity, a marker threshold value is often crossed between two dates; this leads to “intervalcensored data” [24]. For example, for each couple of measurements, the crude method supposes that the marker value was Yi at time ti and Yj at time tj , whereas Yj was actually reached and crossed during interval ]ti ; tj]. Biomarkers with a detection threshold raise similar issues. All that can be known is that the biomarker has crossed the detection threshold between two dates. One way to deal with interval-censored measurements is to estimate the exact threshold crossing times using a Bayesian method with non-informative priors and assuming that, for a given threshold, the crossing times of all patients who crossed it follow a Weibull distribution. The Weibull distribution was chosen because it is commonly used to model times to event, in particular failure times, but other positive distributions can be used if appropriate. The moment at which each observed marker value is crossed by each patient can be estimated. Unlike the crude method, that Bayesian method uses all the information contained in interval-censored data or measurements below a detection threshold. Then, in patients who develop the disease, the most recent threshold value crossed at Ti – t is used as a diagnostic test for ROC analysis. In patients who do not develop the disease, the diagnostic test used is the highest threshold value crossed between sj and sj + 1 obtained using the Bayesian method.
3. Simulation Study 3.1 Numerical Studies
2. Methods 2.1 Time-dependent Accuracy Definition
2.2 Time-dependent Accuracy Estimation
Heagerthy and Zheng [23] have proposed several ways to integrate time into ROC analysis according to how “cases” and “controls” are defined. As recommended by Pepe et al. [22], the
Estimating incident sensitivity requires that a marker measurement be taken exactly t days before the onset of the disease in each subject who developed that disease, which is not the
Methods Inf Med 3/2009
Numerical studies were carried out to compare the results obtained with the crude method to those obtained with the Bayesian method. Let us consider 200 subjects who developed a given disease at time Ti , and 100 subjects who did not develop that disease. Marker measurements were considered throughout a follow-up duration that did not © Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
Table 1 Estimated mean AUC values and sensitivities for thresholds 1, 2, 3, and 4, with their respective standard errors, obtained with the Bayesian method and the crude method over 100 simulations, for three delays between marker measurement and disease outcome
Delay
Method
AUC
Se 1
Se 2
Se 3
Se 4
2
Theoretical
0.985
0.999
0.971
0.787
0.378
Bayesian
0.868 (0.037)
0.959 (0.027)
0.842 (0.045)
0.617 (0.056)
0.312 (0.050)
Crude
0.697 (0.029)
0.885 (0.026)
0.610 (0.029)
0.284 (0.038)
0.085 (0.043)
Theoretical
0.791
0.871
0.5
0.129
0.012
Bayesian
0.616 (0.031)
0.838 (0.025)
0.509 (0.038)
0.175 (0.048)
0.026 (0.029)
Crude
0.458 (0.045)
0.717 (0.030)
0.299 (0.049)
0.060 (0.055)
0.012 (0.041)
Theoretical
0.618
0.664
0.233
0.03
0.001
Bayesian
0.418 (0.048)
0.682 (0.037)
0.253 (0.055)
0.043 (0.051)
0.006 (0.025)
Crude
0.342 (0.049)
0.578 (0.041)
0.176 (0.051)
0.027 (0.045)
0.007 (0.032)
4
6
True AUCs and sensitivities were estimated according to process of generation of the biomarker values. Se denotes sensitivity.
exceed 30 days. High marker values were considered indicative of disease onset. The way data were simulated is described in 씰Appendix 1. The biomarker predictive ability was assessed by the crude and the Bayesian method. Sensitivity was estimated at t = 2, 4, and 6 days before the outcome. Specificity was estimated only during the period [0, 10[ days after inclusion because, in controls, there was no trend for change of biomarker values over time. One hundred simulations were performed. The means obtained for the 100 areas under the ROC curve and for sensitivities at four threshold values (1, 2, 3, and 4) were compared to the theoretical time-dependent area under the ROC curve and sensitivity assessed according to the process of generation of the biomarker values (씰Table 1).
3.2 Results Except for the delay of six days and the threshold value 4, the Bayesian method led to higher sensitivities with differences ranging between 0.02 and 0.33. The standard errors were roughly of the same order of magnitude with the two methods. The comparisons with the theoretical results showed that, except for the delay of six days and the threshold value 4, the crude method clearly underestimated the test sensitivity and that the use of the Bayesian method corrected this underestimation. Besides, except for a delay of two days, the sensitivities obtained with the Bayesian method were close to the theoretical values with small differences ranging between –0.05 and 0.03. The precision of threshold crossing times es© Schattauer 2009
timates depends partly on the measurement frequency. With measurements taken approximately every three days, there is a lack of information to precisely estimate the latest threshold crossed two days before the event, especially when the biomarker values increase as quickly as the onset of disease become closer in time. This explains the differences between the theoretical and the Bayesian results. A way to increase the precision of Bayesian estimates is to make more frequent measurements or to increase the number of cases. Both the Bayesian and the crude method underestimated the specificities at low thresholds (data not shown). This was not due to the exact estimations of the thresholds crossing times but to the fact that specificity was assessed using the highest value reached in each control during a given period. The longer was the period, the highest was the bias. Hence, the choice of the period should be made with great caution. The AUC values obtained with the Bayesian method were higher than those obtained with the crude method and corrected partly the underestimation of accuracy with the latter method. The differences between the Bayesian and the theoretical values came from underestimation of sensitivity with a delay of two days, but also and mainly from underestimation of specificity. The Bayesian methods led to a better estimation of sensitivity, which is the aim of the present article. Underestimation of specificity came from the empirical assessment of specificity and not from the exact threshold crossing times.
4. Example: CMV Disease Prediction after Renal Transplantation 4.1 Study Description The study involved 68 patients who had undergone kidney transplantation between January 1, 1999 and December 31, 2003, at the Centre Hospitalier Lyon Sud (Lyon, France). All were CMV-seropositive before transplantation; 46 received a CMV-positive graft and 22 a CMV-negative one. They were weekly monitored for CMV by quantitative PCR during eight weeks after transplantation, semi-monthly until the third month, then monthly until the sixth month. Because the probability of developing CMV disease six months after renal transplantation is low, patients who did not present a CMV disease after a six-month follow-up were considered disease-free. CMV infection was defined as isolation of CMV by early or late viral culture. CMV disease was defined as the presence of the above defined CMV infection plus either: i) an association of two among the following clinical or biological signs: temperature above 38 °C for at least two days, leukopenia (less than 3.5 G/L), thrombocytopenia (less than 150 G/ L), abnormalities of liver enzymes (twice or more the reference levels); ii) isolated leukopenia (less than 3 G/L); or iii) tissue injury (invasive disease). The PCR method had a detection threshold of 200 copies/mL; 321 measurements out of 494 fell below this threshold. Those leftMethods Inf Med 3/2009
301
302
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
Fig. 1 PCR measurements for cases (solid lines) and controls (dotted lines) versus measurement day after transplantation. x and y scales have been truncated.
measurements in 25 patients. Sensitivity was estimated at t = 0, 5, and 10 days before the outcome, with measurements in 43 patients. Threshold crossing times were estimated using the Bayesian method. The model was fitted using WinBUGS software package [25]; its corresponding code is given in 씰Appendix 2. ROC curves were then constructed with those sensitivity and specificity estimates. There was a large gap between thresholds 0 and 200 on ROC curves, although there was no information on other in-between thresholds. Therefore, only the partial area above threshold 200 was estimated [26]. The obtained values were transformed in values between 0 and 1, as proposed by McClish [27]. The confidence intervals (CI) for AUC values and the standard errors (SE) for sensitivity and specificity were assessed by bootstrap, based on 1000 samples.
4.2 Results censored measurements were given value 0. Forty-three subjects developed a CMV disease with transplantation-to-disease quartiles 21, 25, and 31 days, respectively. The quartiles relative to the number of measurements in those patients were 3, 4, and 5 measurements, respectively. Most patients who developed a CMV disease had an earlier sharp increase in the viral load (씰Fig. 1). The viral load of the 25 subjects who did not develop the disease remained generally low;
however, six of them had a slight increase starting from the 20th day, followed by a decrease starting about the 30th day, then a return to the initial level. This may strongly influence the diagnostic test specificity. However, during the first 30 days, the variability between measurements in subjects who did not develop the disease remained very low. Specificity was estimated at four periods after transplantation, p1 to p4: [0; 10[, [10; 20[, [20; 30[, and [20; 30[ days, respectively, with
For a fixed delay between marker measurement and disease onset, the ROC curves corresponding to the first 10 days p1 and 10–20 days p2 after transplantation were very close (씰Fig. 2). Regarding the two later periods p3 and p4, the ROC curve was as much close to the diagonal as the period was late after transplantation. For each period during which specificity was estimated, the ROC curves were all the more close to the diagonal that the delay between marker measurement and
Fig. 2 ROC curves estimated at three delays between marker measurement and disease onset (t = 0, 5, and 10 days) and during four periods after transplantation for specificity: p1 = [0; 10[, p2 = [10; 20[; p3 = [20; 30[, and p4 = [20; 30[ days Methods Inf Med 3/2009
© Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
disease onset increased. AUC estimates in 씰Table 2 show that the test accuracy was good during the two first periods after graft and at 0- and 5-day delay between test and disease onset (t = 0 and t = 5). The AUC was then over 75% but it decreased quickly as the period and the delay increased. The AUC decrease with the advance of the period was linked to a decrease of specificity in late periods; thus, specificity depended on the period after graft. The decrease of the AUC along the delay between marker measurement and disease onset was linked to a decrease of sensitivity. The discriminant ability was not significantly greater than 0.5 neither in the third period p3 with t = 10 nor in the fourth period p4 with t = 5 or t = 10 (value 0.5 lies within the 95% confidence interval). At the specific threshold of 200, the sensitivity was above 80% for t = 0, but lower than 50% at t = 10 (씰Table 3). This threshold was associated with a good specificity during the two first periods p1 and p2, but that specificity decreased quickly to less than 50% during the fourth period.
5. Discussion The Bayesian method to estimate the exact threshold-crossing times described in this article allows estimating incident sensitivity and static specificity of a longitudinal biomarker. The numerical studies showed that the crude method underestimated sensitivity in the case of interval-censored measurements whereas, under some conditions, the Bayesian method corrected that bias. In the application, quantitative PCR seemed reliable to predict CMV disease within five days preceding the disease onset and within the first 20 days after transplantation. Before that fifth day, the test sensitivity decreased quickly with the increasing delay between marker measurement and disease onset and the test specificity decreased quickly after the 20th day after transplantation. To our knowledge, this is the first study on early diagnosis of CMV disease that took into account the progression of accuracy with both the marker measurement time and the delay between marker measurement and the disease clinical detection. This was found crucial and explained the differences that exist in the literature about quantitative PCR accuracy, © Schattauer 2009
Table 2 Partial AUC values (95% confidence interval) estimated at three delays between marker measurement and disease onset and during four periods for specificity Period after graft (days)
Delay between test and disease onset (days) 0
5
10
[0; 10[
0.852 (0.783; 0.907)
0.769 (0.694; 0.833)
0.662 (0.591; 0.721)
[10; 20[
0.845 (0.780; 0.906)
0.759 (0.684; 0.833)
0.647 (0.574; 0.717)
[20; 30[
0.757 (0.661; 0.844)
0.669 (0.569; 0.761)
0.550 (0.349; 0.642)
[20; 30[
0.634 (0.509; 0.759)
0.555 (0.356; 0.678)
0.344 (0.157; 0.555)
where the delay or the measurement period changes from one study to another [28–30]. The use of the highest biomarker value from each control during a given period may lead to an underestimation of specificity; this bias is conservative because we are sure that the true biomarker accuracy is not smaller than the one estimated. There is no consensus throughout the literature on the way to estimate specificity empirically with repeated marker measurements. Our choice was partly motivated by Murtaugh [31], who also kept the highest marker value from each control to estimate specificity. He compared these results to those obtained keeping the average marker value from each control, but the differences were slight. Emir et al. [32, 33], then Slate and Turnbull [15] proposed another way to assess static specificity without modeling it. At a specific threshold, the specificity
Table 3 Estimated sensitivities and specificities (standard error) for quantitative PCR, the threshold being 200 copies/mL Sensitivity Delay between test and disease onset (days) 0
0.814 (0.063)
5
0.651 (0.073)
10
0.442 (0.077)
Specificity Period after transplantation (days) [0; 10[
0.960 (0.040)
[10; 20[
0.880 (0.066)
[20; 30[
0.680 (0.091)
[20; 30[
0.480 (0.100)
with each control was estimated by the proportion of negative tests; then the global specificity was defined as the average of all individual specificities, possibly weighted by the number of measurements per subject. The possible bias of this method was not analyzed; the underestimation might be smaller than the one stemming from Murtaugh’s method; however, both methods should lead to similar results when estimation periods are short, with few measurements by subject. All those methods could be used after estimation of the threshold-crossing times. A third method would be to model specificity; but then, the bias would depend on the validity of the model assumptions. Certainly, there is still a lot of work to do about estimation of specificity with repeated measurements along time. One contribution of this article is the assessment of specificity over different periods. This is relevant when specificity progresses along time after inclusion. The exact estimation of the thresholdcrossing times relies on the assumption that, for a specific threshold, the crossing times follow a Weibull distribution. This distribution is commonly used to model failure time data; this is the case of parametric regression for interval-censored data [34–37]. Lindsey [35] compared the results obtained from nine different distributions (including the Weibull, the log-normal, and the gamma distributions) and concluded that, except for heavily intervalcensored data, the results may change with the distributional assumptions. However, in the above CMV study, the use of a log-normal distribution led to results, and especially ROC curves, which were almost identical to those obtained with a Weibull distribution. Other forms than incident and static have been proposed for sensitivity and specificity Methods Inf Med 3/2009
303
304
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
[23]; for example, estimating the cumulative sensitivity using the measurements taken during the t days preceding the outcome and not exactly t days before the outcome. However, cumulative sensitivity estimates depend on the time to disease distribution conditional on the marker measurement time and, thus, do not simply reflect biomarker sensitivity. In the concept of dynamic specificity, the controls are the patients who do not develop the disease during the t days following a measurement. However, in our study, the patients developed CMV diseases rapidly after transplantation. Among the subjects whose viral load increased during the few days before disease onset, some developed the disease very soon after t days following a measurement; these would therefore be considered as controls, inducing a high estimate of the false-positive rate and, thus, an underestimation of the real specificity. Thus, the incident sensitivity/static specificity definition of accuracy is, to our opinion, the best way to integrate the concept of time in ROC analysis. As stated by Pepe et al. [22], this should be used in most studies. Compared to previous methods [15–20], the one proposed here is really easy to implement using standard statistical softwares (the code for Bayesian computations under WinBUGS is given in 씰Appendix 2). Moreover, there is no need to define and select a model for biomarker progression, sensitivity, specificity, the ROC curve, or the survival conditional to biomarker values; hence, the method can be very quickly adapted to other settings. Despite the need for a complex modeling phase, the method proposed by Cai et al. [17] remains appealing, but it requires large datasets because each biomarker value for which sensitivity or specificity is estimated adds a new parameter to the model; however, biomarker development studies do not always include a high number of patients. Anyway, our method imposes a restriction: it requires control follow-ups be long enough to assume they are real controls, i.e., the method does not allow so far for censoring, but it may be improved to deals with censored data using ideas similar to those proposed by Cai et al. [17]. The next step of our research would be to analyze the effect of the delay between measurements on accuracy estimates when that delay depends on the last measurement value. Within the context of longiMethods Inf Med 3/2009
tudinal biomarker modeling, Shardell and Miller [38], then Liu et al. [39] have directly addressed this problem. We hope our simple method will help statisticians undertake complete and precise analyses of longitudinal biomarkers accuracy taking into account the marker measurement time and the delay between marker measurement and outcome. In most studies, this is essential.
Acknowledgments The authors are grateful to Dr J. Iwaz, PhD, scientific advisor, for his helpful comments on the manuscript.
References 1. Blokh D, Zurgil N, Stambler I, Afrimzon E, Shafran Y, Korech E, Sandbank J, Deutsch M. An information-theoretical model for breast cancer detection. Methods Inf Med 2008; 47: 322–327. 2. Benish WA. The use of information graphs to evaluate and compare diagnostic tests. Methods Inf Med 2002; 41: 114–118. 3. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2003. 4. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the Bayesian network model. Methods Inf Med 2007; 46: 723–726. 5. Maojo V, Martin-Sanchez F. Bioinformatics: towards new directions for public health. Methods Inf Med 2004; 43: 208–214. 6. Goebel G, Muller HM, Fiegl H, Widschwendter M. Gene methylation data – a new challenge for bioinformaticians? Methods Inf Med 2005; 44: 516–519. 7. Liska V, Holubec LJ, Treska V, Skalicky T, Sutnar A, Kormunda S, Pesta M, Finek J, Rousarova M, Topolcan O. Dynamics of serum levels of tumour markers and prognosis of recurrence and survival after liver surgery for colorectal liver metastases. Anticancer Res 2007; 27: 2861–2864. 8. Roy HK, Khandekar JD. Biomarkers for the Early Detection of Cancer: An Inflammatory Concept. Arch Intern Med 2007; 167: 1822–1824. 9. Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer 2004; 4: 309–314. 10. Hanley JA. Receiver operating characteristics ROC methodology: The state of the art. Crit Rev Diag Imag 1989; 29: 307–335. 11. Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993; 39: 561–577. 12. Pepe MS. Receiver operating characteristic methodology. J Am Stat Ass 2000; 95: 308–311. 13. Zhou X-H, McClish DK, Obuchowski NA. Statistical methods in diagnostic medicine. New York: Wiley; 2002.
14. Etzioni R, Pepe M, Longton G, Chengcheng Hu, Goodman G. Incorporating the time dimension in receiver operating characteristic curves: A case study of prostate cancer. Med Decis Making 1999; 19: 242–251. 15. Slate EH, Turnbull BW. Statistical models for longitudinal biomarkers of disease onset. Stat Med 2000; 19: 617–637. 16. Zheng Y, Heagerty PJ. Semiparametric estimation of time-dependent ROC curves for longitudinal marker data. Biostatistics 2004; 5: 615–632. 17. Cai T, Pepe MS, Zheng Y, Lumley T, Jenny NS. The sensitivity and specificity of markers for event times. Biostatistics 2006; 7: 182–197. 18. Zheng Y, Heagerty PJ. Prospective accuracy for longitudinal markers. Biometrics 2007; 2: 332–341. 19. Cai T, Cheng S. Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 2008; 9: 216–233. 20. Song X, Zhou X-H. A semiparametric approach for the covariate specific ROC curve with survival outcome. Stat Sinca 2008; 18: 947–966. 21. Cai TZ, Yingye. Model checking for ROC regression analysis. Biometrics 2007; 63: 152–163. 22. Pepe MS, Zheng Y, Jin Y, Huang Y, Parikh CR, Levy WC. Evaluating the ROC performance of markers for future events. Lifetime Data Anal 2008; 14: 86–113. 23. Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics 2005; 61: 92–105. 24. Lindsey JC, Ryan LM. Tutorial in biostatistics: methods for interval-censored data. Stat Med 1998; 17: 219–238. 25. Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 2000; 10: 325–337. 26. Zhang DD, Zhou X-H, Freeman DH, Freeman JL. A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. Stat Med 2002; 21: 701–715. 27. McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989; 9: 190–195. 28. Naumnik B, Malyszko J, Chyczewski L, Kovalchuk O, Mysliwiec M. Comparison of serology assays and polymerase chain reaction for the monitoring of active cytomegalovirus infection in renal transplant recipients. Transplant Proc 2007; 39: 2748–2750. 29. Mhiri L, Kaabi B, Houimel M, Arrouji Z, Slim A. Comparison of pp65 antigenemia, quantitative PCR and DNA hybrid capture for detection of cytomegalovirus in transplant recipients and AIDS patients. J Virol Methods 2007; 143: 23–28. 30. Madi N, Al-Nakib W, Mustafa AS, Saeed T, Pacsa A, Nampoory MR. Detection and monitoring of cytomegalovirus infection in renal transplant patients by quantitative real-time PCR. Med Princ Pract 2007; 16: 268–273. 31. Murtaugh PA. ROC curves with multiple marker measurements. Biometrics 1995; 51: 1514–1522. 32. Emir B, Wieand S, Su JQ, Cha S. Analysis of repeated markers used to predict progression of cancer. Stat Med 1998; 17: 2563–2578. 33. Emir B, Wieand S, Jung S-H, Ying Z. Comparison of diagnostic markers with repeated measurements: a non-parametric ROC curve approach. Stat Med 2000; 19: 511–523.
© Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
34. Odell PM, Anderson KM, D’Agostino RB. Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 1992; 48: 951–959. 35. Lindsey JK. A study of interval censoring in parametric regression models. Lifetime Data Anal 1998; 4: 329–354. 36. Collet D. Modelling Survival Data in Medical Research. London: Chapman and Hall; 2003. 37. Sparling YH, Younes N, Lachin JM, Bautista OM. Parametric survival models for interval-censored data with time-dependent covariates. Biostatistics 2006; 7: 599–614. 38. Shardell M, Miller RR. Weighted estimating equations for longitudinal studies with death and non-monotone missing time-dependent covariates and outcomes. Stat Med 2008; 27: 1008–1025. 39. Liu L, Huang X, O’Quigley J. Analysis of Longitudinal Data in the Presence of Informative Observational Times and a Dependent Terminal Event, with Application to Medical Cost Data. Biometrics 2008; 64: 950–958.
Ti ~ uniform(15, 20) with probability 0.4 Ti ~ uniform(20, 30) with probability 0.6
biomarker values follow a normal distribution with mean
1.4 Biomarker Values
1 + exp(0.5(4 – Δ))
For Controls
and variance
Throughout each simulation, controls have their own biomarker value normally distributed with mean 1 and variance 0.25; for each measurement, an error is added that follows a normal distribution with mean 0 and variance 0.49. For Cases
In cases, biomarker values are generated as for controls up to eight days before diagnosis; for later measurements, an extra term is added: exp(2 – (0.5 + δi) Δik)
Appendix 1 1. Generation of the Simulated Data 1.1 Notation i = subject index; k = kth marker measurement; sik = time of the kth measurement for the ith subject; Δik = delay between the kth measurement and the diagnosis time for the ith subject
1.2 Sampling Times (sik) Patients should have a biomarker measurement every three days for 30 days after inclusion into the study; but, actually, the measurement is often delayed. Generate: sik = 3k + εik, k = 0, ..., 9 εik =
{
uniform (1, 2.95) if k = 0, uniform (0, 2.95) if k >0.
1.3 Time of Diagnosis The time of diagnosis was generated as follows:
© Schattauer 2009
δi corresponds to patients’ specific biomarker increase with time between marker measurement and diagnosis. It follows a normal distribution, with mean 0 and variance 0.0025. Measurements taken after the time of diagnosis are removed.
2. Calculation of the Theoretical AUC Values When biomarkers follow normal distributions in the diseased and non-diseased populations (respectively N( ) and N( )), Pepe et al. [3] showed that the AUC for the ROC curve is given by
where a = (μD – μD–)/σD, b = σD–/σD, and Φ denotes the standard normal cumulative function. According to the process of generation of biomarker values, during each period, measurements in control subjects follow a normal distribution with mean 1 and variance 0.25 ± 0.49. In cases, for a delay Δ between the marker measurement and the diagnosis time, the
exp(4 – Δ) × Var(exp(–δ × Δ)) where δ follows a normal distribution with mean 0 and variance 0.0025. For small delays Δ , the variance may be approximated using the delta-method; for our applications, the variance was estimated using 107 random values stemming from a normal distribution with mean 0 and variance 0.0025. Those results allow us to calculate the theoretical AUC for each period and delay between marker measurement and the onset of disease.
Appendix 2 The WinBUGS code for estimating the exact threshold-crossing time (paragraph ROC curve analysis). model { for(i in 1:N) ## N corresponds to the number of crossings { crossing_time[i]~dweib(r,mue)I (left[i],right[i]) ## left[i] corresponds to the date of last PCR measurement whose result was inferior to the threshold ## right[i] corresponds to the date of first PCR measurement whose result was superior or equal to the threshold } r~dgamma(1.0E-3, 1.0E-3) mue<-exp(mu) mu~dnorm(0,0.000001) }
Methods Inf Med 3/2009
305
306
© Schattauer 2009
Original Articles
On Graphically Checking Goodnessof-fit of Binary Logistic Regression Models G. Gillmann1, 2; C. E. Minder2, 3 1Swiss
Federal Statistical Office, Neuchâtel, Switzerland; of Social and Preventive Medicine, University of Bern, Bern, Switzerland; 3Horten Zentrum, University of Zurich, Zurich, Switzerland 2Department
Keywords Nonlinear models, residuals
Summary Objectives: This paper is concerned with checking goodness-of-fit of binary logistic regression models. For the practitioners of data analysis, the broad classes of procedures for checking goodness-of-fit available in the literature are described. The challenges of model checking in the context of binary logistic regression are reviewed. As a viable solution, a simple graphical procedure for checking goodness-of-fit is proposed. Methods: The graphical procedure proposed relies on pieces of information available from any logistic analysis; the focus is on combining and presenting these in an informative way. Results: The information gained using this approach is presented with three examples. In the discussion, the proposed method is put into context and compared with other graphical procedures for checking goodness-of-fit of binary logistic models available in the literature. Conclusion: A simple graphical method can significantly improve the understanding of any logistic regression analysis and help to prevent faulty conclusions.
Methods Inf Med 2009; 48: 306–310 doi: 10.3414/ME0571 received: May 15, 2008 accepted: September 20, 2008 prepublished: March 31, 2009 Correspondence to: Gerhard Gillmann Swiss Federal Statistical Office Espace de l’Europe 10 2010 Neuchâtel Switzerland E-mail: [email protected]
Methods Inf Med 3/2009
1. Introduction Studies of the medical literature find frequent and enduring misuse of statistical methods, also in connection with modeling (see e.g. [1]). Along with other omissions, checking goodness-of-fit is severely under-reported and probably under-used. Given the reliance on model-based inference and conclusions, this is an unsatisfactory state of affairs. In this paper we consider checking goodness-of-fit for binary logistic regression models. We chose this class of models because they are widely used. Standard statistical wisdom says that once a statistical model has been chosen and fitted, it needs to be checked against the data it is applied to [2–5]. There are two general approaches to checking the goodness-of-fit of statistical models: formal tests of goodness-offit and graphical analyses of residuals. Several formal tests of goodness-of-fit for binary logistic models have been proposed [6–8]. See Le Cessie and van Houwelingen [9] for a description of the associated problems. These authors also propose a new test designed to avoid the problems associated with the previously suggested tests. The power of several goodness-offit tests for logistic regression is examined in [10]. Although some of the tests have theoretical weaknesses [9], favorable findings for detecting moderate nonlinearity with samples exceeding n = 100 were reported. Results were less favorable for detecting interactions and incorrectly specified link functions. From the point of view of good statistical practice, the use of one of the tests of goodness-of-fit recommended in [10] should become routine. This is not onerous on the data analyst, as, for example, a version of the Hosmer-Lemeshow test is implemented in many standard packages for logistic regression. Because, given the fitted values, binary residuals are not informative, graphical ap-
proaches based on residuals are not very promising in the case of binary logistic regression (see [11, 12] and Discussion). In this paper, we focus on an alternative graphical approach based on model-based and model-free expectations originally proposed by Copas in [13]. This approach complements any formal test and provides information regarding the nature of any deviation indicated by the test. In addition, the graphical procedure proposed presents at a glance other useful information on the quality of the model otherwise not easily available. In Section 2 we present this graphical procedure. In Section 3, we present three examples to show the extra information that may be gained by applying the proposed method. We have used this approach routinely in the last years and have found it to be helpful. Copas and Marshall gave in [14] a thorough and insightful account of the process of logistic modeling in a complex situation where they discuss the role of similar graphical procedures for checking goodness-of-fit. In Section 4, we review other proposals of graphically checking goodness-of-fit found in the literature. Modifications of the proposed procedure and areas for future investigation are discussed as well.
2. A Simple Graphical Method The following procedure can be implemented easily using standard statistical software such as R, SAS or STATA. It works as follows: 1. Fit the model and obtain the estimated probabilities pfit, 1, . . ., pfit, n corresponding to the observations 1, ..., n. 2. Let k = min(10, √n) and m = [n/k], with [.] designating the integer part.
G. Gillmann; C. E. Minder: On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models
3. Sort the observations according to the size of their estimated probabilities pfit, i into k bins, each bin containing m (or m + 1) observations. The counts m + 1 arise as the remaining n – k × m fitted values must also be assigned to a bin. 4. For each bin l, calculate the average pf, l of the estimated probabilities pfit, i in the bin. 5. For each bin l, determine the fraction po, l of 1’s among the response values. The po, l are a set of model-free estimates of the average probability of success of an observation in bin l. 6. Plot po, l vs. pf, l , l = 1, ..., k into a square grid with x- and y-axes each covering the interval [0, 1]. Without doubt, many applied statisticians have used methods similar to the one proposed here (see Discussion). We were, however, not able to find a corresponding reference presenting and investigating the method in some detail. We have applied the procedure presented in the sequel routinely in our data analyses of biomedical and epidemiological studies. It has proven useful, especially in the context of multiple logistic regressions with a large number of realized values of the linear predictor; it may prove uninformative in cases with only a small number of such values. As a rule of thumb, at least 100 observations should be available; the smaller the values of k and m in (2) above, the less powerful is the method. The three examples in the next section illustrate that the resulting graph conveys valuable information on the model fit not easily accessible otherwise.
such measure is dependence in instrumental activities of daily life (IADL). A person is deemed IADL-dependent if he requires assistance with one or more of the following activities of daily life: cooking, handling finances, handling medications, shopping, using public or private transportation, and using the telephone. The strictest, and from a health care finance point of view most interesting, such measure is NHA, the permanent admission to a nursing home. NHA has the drawback of being heavily influenced by factors independent of the functionality of the person considered, such as the presence of a partner at home, the availability of home nursing care, etc. We proceed to illustrate the use of the graphical method proposed above by three examples, one for BADL, one for IADL and one for NHA. Each example is chosen to illustrate some of the information one can gain by applying the method proposed.
3.1 Two Examples of Logistic Regression Showing no Important Discrepancies between Model and Data 씰Figure 1 shows the relation of observed and
predicted probability of being BADL-independent after three years of follow-up. The
graph relates to an analysis of the baseline factors associated with BADL-independence three years after beginning the intervention: intervention (yes/no), age (years), risk stratum (low/high) and BADL-independence at baseline (yes/no). Stratum is a binary indicator of risk of future dependence on help determined at baseline. The graph provides the following information not readily available from a model fit alone. ● It shows that the model achieves good predictive power: its range of predicted probabilities averaged over deciles of BADLindependence pf, l extends from 9.5% to 95%. The corresponding bin average fractions po, l range from 9.8% to 99%, showing good agreement. ● The relation between decile-averaged observed probabilities po, l and model-based decile averages pf, l is approximately linear. ● Together, the first two points show that the bin average of the probability of becoming BADL-independent can be predicted well at baseline by the model. ● The model is well supported by data in the domain with probability of BADL-independence above 50%. However, there is a gap between the first and second decile, from about 10% to about 50%. This is due to the large effect of baseline BADL-dependence on 3-year BADL-dependence: a person that is BADL-dependent at base-
Fig. 1
3. Three Examples The examples in this section all are taken from a geriatric intervention study with several years of follow-up [15]. The intervention was designed to reduce dependence on help and nursing home admissions among the elderly. Its effectiveness was judged by, among other things, dependence in basic activities of daily life (BADL). A person is deemed BADLdependent if she requires assistance in performing one or more of the following activities of daily life: bathing, dressing, eating, transferring from bed to chair, and moving around inside her own residence. Another © Schattauer 2009
A model with good predictive power, but with gaps in the support for the probability of independence in the basic activities of daily life (BADL-independence). X-axis: The values of pf, l , l = 1, ..., k are bin averages of predicted values; y-axis: The values of po, l , l = 1, ..., k are bin averages of BADL. For details, see Sections 2 and 3.1.
Methods Inf Med 3/2009
307
308
G. Gillmann; C. E. Minder: On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models
Fig. 2 A satisfactory model for the probability of dependence in the instrumental activities of daily life (IADL-dependence) with good predictive power and no gaps in its support. X-axis: The values of pf, l , l = 1, ..., k are bin averages of predicted values; y-axis: The values of po, l , l = 1, ..., k are bin averages of IADL. For details, see Sections 2 and 3.1.
Fig. 3 A less than satisfactory model for the probability of nursing home admission (NHA) with very limited support (0% to 9.1% only). X-axis: The values of pf, l , l = 1, ..., k are bin averages of predicted values; y-axis: The values of po, l, l = 1, ..., k are bin averages of NHA. Low predictive power and nonlinearities indicate a bad fit.
line is likely to be BADL-dependent three years later. This large gap also may mean a large bias in the position of the first two points. These points may be far away from the model curve, as the coordinates of each are the result of averaging over a large range.
Methods Inf Med 3/2009
The model appears to give a satisfactory representation of the study data and should permit reliable conclusions for the larger probabilities. The graph indicates that in future studies one might consider replacing the predictor “baseline BADL-independence” by an indicator permitting a finer gradation of the minor impairments of func-
tionality to improve the model’s predictive power in the lower probability segment of the population. 씰Figure 2 shows a similar graph for a model of IADL-dependence at three years. This model does not exhibit such large gaps between the fitted probabilities and appears well supported by data in the range of between about 11% and 95% IADL-dependence. There is also less potential for serious bias in this case than in the first example.
3.2 An Example of a Less than Satisfactory Logistic Regression 씰Figure 3 shows an analysis of the response
“ever admitted to nursing home” (NHA) at the 3-year follow-up for the persons from the low-risk stratum at baseline. In this case, a naively interpreted logistic regression analysis indicates that the intervention results in a significant improvement. However, this conclusion may be premature due to the bad fit of the model. First, the range of the decile averages of fitted probabilities of nursing home admission varied between 0% and 9.1% only (observed probabilities in the deciles from 0% to 14.3%). Moreover, the graph shows substantial nonlinearity, suggesting that important variables were missing. Using the information conveyed by the graph, one is lead to conclude that, despite a significant logistic regression, NHA at three years cannot be predicted well at baseline from the factors available.
4. Discussion 4.1 The Proposed Approach In this paper, a simple but not widely used graphical method of assessing the goodnessof-fit of a logistic regression model is presented. The essence of the procedure goes back to Copas [13]. The method is related to the Hosmer-Lemeshow test [6]. In our experience, graphical methods such as the one presented are indispensable complements to routine model fits and formal tests. As is illustrated by the examples, the graphical approach presents at a glance important additional information about the properties and quality of the fitted model which otherwise is © Schattauer 2009
G. Gillmann; C. E. Minder: On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models
onerous to obtain. Through the proposed graphical representation, shortcomings and the predictive value of the model can be assessed immediately. The scheme presented in Section 2 is only one of many possibilities of graphical representations of the model fit. Its advantages are that it is simple, easily implemented and easily communicated. In marked contrast to the proposals based on nonparametric smoothing discussed below, the graph proposed here illustrates well the data support of the model: the spacing of the points immediately conveys an idea in which regions there are little data to support (and test) the model. The graph also shows immediately how good the predictive quality of the model is. It is in this area where we have in fact found frequent use of similar graphs (see, for example, the excellent paper by Copas and Marshall [14]). Similar approaches are used to assess clinical predictive models (see, for example [16–18]). However, we have not found any paper giving a detailed description of the method and interpretation of the resulting graph. Drawbacks of the proposed method include its arbitrariness as regards the number and choice of bins. It gives a discrete image of the continuous response curve and hence is coarse and biased. In theory, the method also suffers from a drawback similar to the one of the Hosmer-Lemeshow test [9]. In the context of the detailed information of the predictive power of the model obtained through our approach, we consider this to be of little practical relevance. Our extended experience with the method, as well as the usage of similar methods in the literature, lead us to conclude that the proposal made here is a reasonable compromise. As a rule of thumb about 10 bins are needed to limit bias and to permit plotting the course of the model prediction with sufficient detail. On the other hand, the number of observations per bin should not be too small in order to preserve some discriminatory power. The proposed scheme may of course be modified and improved for specific situations, e.g., by varying the number and size of bins according to the configuration of independent variables or other specific needs. What is presented in Section 2 is a simple omnibus proposal, which in our experience has worked reasonably well in many cases. Copas © Schattauer 2009
[13, 14] used a data-driven nonparametric smoother such as LOWESS [19] to produce the model-free estimates of the probabilities. His approach is also applied in many other papers. Use of a nonparametric smoother reduces bias and eliminates the arbitrariness of choice of bin number, bin size and bin positions and so reduces the arbitrariness of the procedure. It has, however, the disadvantage of not exhibiting the extent of data support, which in the proposed method is conveyed by the spacing of the points. Further work could aim at combining the advantages and eliminating the drawbacks of the bin based and the nonparametrical smoother-based approach. The following Subsections 4.2 to 4.4 contain a discussion of the difficulties as well as a short outline of the development of graphical methods for assessing model quality of logistic regression models.
4.2 Difficulties with Residualbased Methods in Logistic Regression Let p be the estimated probability for an observation Y = 0, possibly depending on some independent variables. Due to the discrete nature of the binary responses, residuals can only take the two values –p and 1 – p, corresponding to Y = 0 or Y = 1. Plotting these residuals vs. p produces a graph with two distinct bands of residuals over the span of p, i.e., graphs of residuals from logistic regression always show structure, even if the model is correct. This is in marked contrast to ordinary regression with a continuous response variable, where the presence of structure in the residual plot indicates model-inadequacy. Therefore, residual plots from logistic regression models are difficult to interpret and hence of limited use. On a more subtle level, when interpreting graphs of residuals, we intuitively rely on approximate independence between residuals and fitted values, a feature not generally available in nonlinear models such as the logistic. Furthermore, approaches based on residuals are not very promising in the case of binary logistic regression, as the residuals contain only limited information beyond the estimated expected values pi [11, 12]. These difficulties appear to be the principal reason why checking goodness-of-fit
based on residuals is hardly done with logistic regression. For the same reason, most current methods for graphically checking goodness-of-fit in logistic regression rely on some kind of smoothing. For further comments on this topic, see also the introductions to the papers by Copas [13], Fowlkes [20], Cook and Weisberg [21] and section 2.3 of Pardoe and Cook [22].
4.3 Copas Plots and Extensions Various authors have used plots based on Copas’s suggestion of 1983 [13]. Most frequently the plots were applied to compare individual predictors with smoothed values of the observed probabilities. Theoretical papers applying such methods are, among others, Landwehr, Pregibon and Shoemaker [23], Spiegelhalter [16], and Fowlkes [20]. Cook and Weisberg proposed extensions to the Copas plot, plotting not only against p, but also against other linear combinations of covariables [21]. They term the resulting graphs marginal model plots. A judiciously chosen set of such graphs may prove useful in extended examinations of a model. Pardoe and Cook extended this work using a Bayesian approach to assess model uncertainty in marginal plots [22]. This paper also provides extensive and detailed justification of why marginal plots (and hence the Copas plot) are to be preferred to residual plots in logistic regression. The tool kit assembled by Cook, Weisberg and Pardoe provide a Bayesian alternative to simulation and bootstrapping suggested by Landwehr, Pregibon and Shoemaker in [23]. The excellent paper by Copas and Marshall is a must reading in regard to applications [14]. It contains a lucid description of the process of developing a logistic model and the role of the Copas plot in this process. Papers by Harrel et al. and Steyerberg et al. use the Copas plot in the development of prediction rules [17, 18]. The paper by Harrel et al. is intended as a tutorial and contains a variety of other model assessment methods beside the Copas plot.
Methods Inf Med 3/2009
309
310
G. Gillmann; C. E. Minder: On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models
4.4 Other Graphical Methods for Model Assessment In 1968, Cox and Snell proposed a general definition of residuals, extending normal linear model theory to nonlinear models [24]. They worked out the binomial case with n exceeding 5, obtaining good results. The introduction of their paper gives a lucid description of the problem area. Pregibon proposes various diagnostic plots for the assessment of logistic regression models [25]. Although the residual-based methods proposed in this paper were subsequently questioned and now are no longer much used, the paper was seminal for further methodological developments. The plots showing the effects on parameter estimates and confidence intervals of deleting each observation in turn appear to be still useful. Apart from Copas, Landwehr, Pregibon and Shoemaker were the first to use smoothing to improve properties of residuals and derived statistics [13, 23]. They present various ways for assessing local discrepancies between data and model, such as the local mean deviance plot and partial residual plots (where they not yet apply smoothing). As a global assessment tool, they propose Copas’s plot, supplemented with (computer-intensive) simulated confidence bands. Fowlkes mentions as a drawback of Landwehr-style partial residuals the pronounced structure these residual exhibit, a feature due to their binomial nature [20]. He proceeds to develop smoothed residuals designed to remove this undesirable structure, and local test statistics based on smoothed residuals. For fairly large samples of n = 400 observations, Fowlkes’s methods appear to work well.
Methods Inf Med 3/2009
Our literature review suggests that recent work elaborates the approach pioneered by Copas, a variant of which is suggested in the present paper as an omnibus graphical model assessment tool.
Acknowledgments We thank our colleagues Ch. Schindler and A. Stuck as well as two reviewers for contributing to this paper by their support, insights, comments and suggestions as well as the permission to use data.
References 1. Strasak AM, et al. The Use of Statistics in Medical Research: A Comparison of The New England Journal of Medicine and Nature Medicine. The American Statistician 2007; 61: 47–55. 2. Davison AC. Statistical models. Cambridge: Cambridge University Press; 2003. Section 8.6. 3. Draper N, Smith H. Applied Regression Analysis. New York: Wiley & Sons; 1998. 4. Weisberg S. Applied Linear Regression. New York: Wiley & Sons; 1985. 5. Daniel C, Wood F, Gorman JW. Fitting Equations to Data: Computer Analysis of Multifactor Data. New York: Wiley & Sons; 1980. 6. Hosmer J, Lemeshow S. Applied Logistic Regression. New York: Wiley & Sons; 1989. 7. Brown CC. On a goodness-of-fit test for the logistic model based on score statistics. Communications in Statistics 1982; 11(10): 1087–1105. 8. Stukel TA. Generalized logistic models. Journal of the American Statistical Association 1988; 83: 426–431. 9. Le Cessie S, van Houwelingen JC. A Goodness-offit Test for Binary Regression Models Based on Smoothing Methods. Biometrics 1991; 47, 1267–1282. 10. Hosmer D, Hosmer T, le Cessie S and Lemeshow S. A Comparison of Goodness-of-Fit Tests for the Logistic Regression Model. Statistics in Medicine 1997; 16: 965–980. 11. Fienberg SE, Gong GD. Comment following the paper of Landwehr, Pregibon and Shoemaker. Jour-
nal of the American Statistical Association 1984; 79: 72–77. 12. Copas JB. Binary Regression Models for Contaminated Data (with discussion). Journal of the Royal Statistical Society B 1988; 50: 225–265. 13. Copas JB. Plotting p against x. Journal of the Royal Statistical Society C (Applied Statistics) 1983; 32: 25–31. 14. Copas JB, Marshall P. The offender group reconviction scale: A statistical reconviction score for use by probation officers. Journal of the Royal Statistical Society, Series C (Applied Statistics) 1998; 47: 159–171. 15. Stuck AE, Minder CE, Peter-Wuest I, Gillmann G, et al. A Randomized Trial of In-Home Visits for Disability Prevention in Community-Dwelling Older People at Low and High Risk for Nursing Home Admission. Archives of Internal Medicine 2000; 160: 977–986. 16. Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 1986; 5: 421–433. 17. Harrell FE, et al. Tutorial in Biostatistics. Development of a clinical prediction model for an ordinal outcome. Statistics in Medicine 1998; 17: 909–944. 18. Steyerberg EW, Vergouwe Y, Keizer HJ, Habbema JDF. Residual mass histology in testicular cancer: development and validation of a clinical prediction rule. Statistics in Medicine 2001; 20: 3847–3859. 19. Cleveland WS. Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association 1979; 74: 829–836. 20. Fowlkes E. Some Diagnostics for Binary Logistic Regression via Smothing Methods. Biometrika 1987; 74: 503–515. 21. Cook RD, Weisberg S. Graphics for Assessing the Adequacy of Regression Models. Journal of the American Statistical Association 1997; 92: 490–499. 22. Pardoe I, Cook RD. A Graphical Method for Assessing the Fit of a Logistic Regression Model. The American Statistician 2002; 56: 263–272. 23. Landwehr JM, Pregibon D, Shoemaker AC. Graphical Methods for Assessing Logistic Regression Models. Journal of the American Statistical Association 1984; 79: 61–71. 24. Cox DR, Snell EJ. A General Definition of Residuals. Journal of the Royal Statistical Society B 1968; 30: 248–275. 25. Pregibon D. Logistic Regression Diagnostics. The Annals of Statistics 1981; 9 (4): 705–724.
© Schattauer 2009
News from the International Medical Informatics Association
III
It is with deep regret and sadness that I share with you the news that Steven Huesing, IMIA Executive Director, passed away on Sunday, April 12, 2009. Steven Huesing was an outstanding person and professional. As Executive Director of the International Medical Informatics Association, he has for many years provided significant and global contributions to the progress of our field. It is through his tireless work that IMIA has developed into the leading international association that it is today. Since the start of his career, in the 1960s, he has been a pioneer and ambassador to the advancement of computers and information technology in healthcare. Among the many recognitions of his contributions, he was honoured for his exceptional work with the prestigious Canadian Health Informatics Award for Lifetime Achievement. Steven has also been described as “one of Canada’s true eHealth pioneers”, serving as Founding President (1975–78) and Executive Director (1980–99) of COACH, Canada’s Health Informatics Association. He was also Editor of the COACH history book, was a co-founder of CHITTA (now the healthcare division of the Information Technology Association of Canada, ITAC) and was Editor and Publisher of Healthcare Information Management & Communications Canada (HIM&CC). He worked in the health in-
dustry and informatics from 1964, holding senior executive, CFO, and CIO positions in healthcare facilities, government and voluntary organizations. Among many other achievements during more than 40 years on involvement in IMIA, COACH and other organisations and activities in health informatics within Canada and internationally, Steven established the COACH Founding President’s Award in 1983 to recognize and motivate outstanding health informatics students at the University of Victoria. Steven was actively involved in developing health informatics curricula with several universities, colleges and associations; in 1999, COACH established the Steven Huesing Scholarship for students in health informatics or related programmes at Canadian post-secondary institutions. Many of you who worked with Steven knew that, during the past few months, his health status has been deteriorating. Nevertheless, I and many others wanted to share his optimism that his health would soon become better again; Steven wrote in his e-mail of March 12 on his presence at the next IMIA meetings “As it stands I will likely not be in Dublin [in April], but I will be in Hiroshima! [in November]”. With Steven we lose a valuable and leading colleague, and more than this, we lose an outstanding person and friend. My thoughts, as
Remembering Steven Huesing Steven Huesing’s family would appreciate donations being made in his memory to the University Hospital Foundation for Kidney Transplant Genomics Research, 8440 112 Street, Edmonton, AB T6G TB7, Canada, or
to the COACH Steven Huesing Scholarship. Scholarship donations can be made online through the COACH website (http://www. coachorg.com) or by mail to COACH, 301–250 Consumers Road, Toronto, ON M2J
Informatics Professor Bill Hersh, Professor and Chair at the Department of Medical Informatics and Clinical Epidemiology, Oregon Health and © Schattauer 2009
Science University, Portland, Oregon, USA and Chair of the IMIA Health and Medical Informatics Education Working Group, has
IMIA News
Steven A. Huesing, IMIA Executive Director
well as those of the IMIA Board and all the IMIA community, are with his family in these hard and sad times. This sad news reached us just prior to this issue of ‘Methods’ going to press, so it is too early to say how IMIA plans to commemorate his life and work, but we will inform everyone in due course. IMIA expresses its thanks to all the many National Societies, past presidents, and individuals who have already expressed their condolences on Steven’s passing. A tribute and appreciation of Steven’s life and work will appear as a special paper in the IMIA Yearbook of Medical Informatics 2009. Dr. Reinhold Haux, President, International Medical Informatics Association
4V6, Canada. The scholarship, established in 1999, commemorates Steven’s spirit, dedication and innovation, for students in health informatics (HI) or related programs. Cards and notes of sympathy and condolence can be sent to The Huesing Family, 37–2508 Hanna Crescent, Edmonton, AB T6M 1B4, Canada.
recently set up a blog to share and explore “thoughts on various topics related to biomedical and health informatics”. It can be accessed at http://informaticsprofessor. blogspot.com/
Methods Inf Med 3/2009
IV
News from the International Medical Informatics Association
IMIA News
Important MedInfo2010 Dates to Remember MedInfo Scientific Mentor Scheme – July 1, 2009 This is a new and important innovation for MedInfo 2010. It provides the opportunity for early career researchers or non-native English speakers to submit their papers for early review by members of an international panel of health informatics experts, who will provide feedback prior to the final MedInfo submission date. Participants in the scheme will then have the opportunity to revise their
paper in light of the feedback, with revised papers then proceeding through the normal scientific review process.
Deadline for MedInfo 2010 Paper Submissions – September 30, 2009. This is the deadline for all submissions; details of how to submit papers, as well as other types of submission including posters,
AMIA Launches Global Fellowship Program with $1.18 Million Grant The American Medical Informatics Association (AMIA) has been awarded a $1.18 million grant from the Bill & Melinda Gates Foundation. Covering the period November 2008 to April 2010, the grant will support the development of a blueprint for training informatics leaders and for promoting health informatics education and training with a focus in low-resource settings. AMIA intends to lead a team of experts to develop scalable approaches to e-health edu-
cation, through the Global Fellowship Program (GFP). Working in collaboration with other partners, AMIA will address the need for a global informatics workforce and scholarly network, though educating and training a new generation of leaders in low resource nations, and by linking them and their institutions to partner institutions affiliated with AMIA and others in the network. AMIA anticipate that, through this work and other related activities, valuable lessons can also be
Prof. Graham Wright and Walter Sisulu University Graham Wright, Chair of the British Computer Society Health Informatics Forum (BCSHIF), Director of the Centre for Health Informatics Research and Development (CHIRAD), and UK representative to IMIA, has been appointed Research Professor of Health Sciences at the Walter Sisulu University (WSU – http://www.wsu.ac.za), Mthatha, South Africa, commencing June 1, 2009. WSU has appointed research champions in
Methods Inf Med 3/2009
all four of its faculties to raise its research profile. The faculty has schools of nursing and medicine, and over 8,000 full time students. Graham will be responsible for development of research activity within the Faculty of Health Sciences and for starting a Department of Health Informatics. Over the past few years, Graham, together with CHIRAD members Helen Betts, John Bryant, and Peter Murray, have been teaching an MSc in Health
panels, workshops and tutorials, which embrace the conference theme and which relate to one of the conference topic areas, are available on the MedInfo2010 website at http://www.medinfo2010.org/abstract.
MedInfo2010 – September 13–16, 2010 The beautiful city of Cape Town, South Africa, will host the 13th World Congress on Medical and Health Informatics, the first African MedInfo.
learned in training and education for capacity building and managing high-quality, lowcost health care in the less-developed economies. Barbara Brown (e-mail: Barbara@amia. org), AMIA’s GFP Project Director, will be working to lay the groundwork for this project, and welcomes comment and input from colleagues doing similar work through fellowship programs and institutional partnerships. Courtesy: Don Detmer, President and CEO, AMIA
Informatics at WSU, and helping set up a Centre of Excellence in Health Informatics. Graham will continue that development, as part of which CHIRAD South Africa has been established as a not for profit company, and in conjunction with Brenda Faye, Sedick Issacs and other colleagues hopes to begin collaborative projects with like-minded researchers around the world. Anyone who has a wish to, in Graham’s words, “help change the world” and is interested in collaborative activities, doctoral supervision, or teaching is invited to contact him.
© Schattauer 2009
IMIA's Explorations in Using Web 2.0 In the previous issue of these pages, we mentioned the IMIA Web 2.0 Exploratory Taskforce. As part of related activities, we have been quietly exploring the use of Web 2.0 tools, with a view to incorporating appropriate applications within IMIA’s daily activity and interaction with its members and the wider health and biomedical informatics communities. There is now an IMIA group on LinkedIn (http://www.linkedin.com ); this has over 150 members, many of whom are new to interest and involvement in IMIA, and see benefit in
using this and other social networking technologies to interact with like-minded colleagues. AMIA, the BCS Health Informatics Forum and CHIRAD also have LinkedIn groups. An IMIA group on Facebook (http://www.facebook.com), established thanks to Gunther Eysenbach, has over 160 members, and AMIA and APAMI are among other health informatics organisations that have Facebook groups. NI2009, the forthcoming 10th International Congress on Nursing Informatics (http://www.ni2009. org), and Medicine 2.0’09, the 2nd Inter-
Open Health Natural Language Processing (NLP) Consortium Biomedical informatics researchers at Mayo Clinic and IBM have launched a website for the newly founded Open Health Natural Language Processing (NLP) Consortium, which aims to promote the open source UIMA (Unstructured Information Management applications) framework and SDK (software development kit) as the basis for biomedical
NLP systems. The goal is to establish an open source consortium to promote past and current development efforts, including participation in information extraction from electronic medical records. The consortium seeks to facilitate and encourage new annotator and pipeline development, exchange insights and collaborate on novel biomedical natural
Sustainable Collaborations in Health Care Open Source Software A workshop held on April 3, 2009 at the Med-e-Tel conference in Luxembourg was designed to seek discussion and involvement in collaborative attempts to build sustainable co-operation in the development, use and promotion of free/libre and open source software in healthcare (FLOSS-HC). Participants agreed that, while many health and health informatics open source groups exist, their collaboration and impact have not been as great as they could have been. There is a need, they agreed, for a FLOSS-HC inventory where people can find what already exists, and which should have rating options for exist-
© Schattauer 2009
ing software projects (similar to the Enterprise Open Source Directory http://www. eosdirectory.com but specific to FLOSS-HC). It was also noted that there is not enough communication among existing FLOSS-HC projects, and that the future development of FLOSS-HC products would benefit from opportunities to provide means of communication and collaboration. In the first step this can be done by a simple wiki system especially for FLOSS-HC. The CHOS-WG website (http://chos-wg.eu) has been established by Etienne Saliez to provide an initial framework for discussions and developments. Partici-
national Congress on Social Networking and Web 2.0 in Medicine and Health (http:// www.medicine20congress.com/) have established event pages on Facebook. Twitter feeds (http://www.twitter.com) have also recently been set up for IMIA (@IMIAtweets), NI2009 (@ni2009) and MedInfo2010 (@medinfo, @medinfo2010), although currently are still in early stages of development and we will be exploring how best to use them. We welcome thoughts from anyone using these facilities as to how beneficial they find them, and for ideas on how IMIA can best make use of such new and emerging technologies and modes of interaction.
language processing systems and develop gold-standard corpora for development and testing. Mayo Clinic and IBM have released their clinical NLP technologies into the public domain. The site (http://www.ohnlp.org) will allow the approximately 2,000 researchers and developers working on clinical language systems worldwide to contribute code and further develop the systems. Courtesy: AMIA e-News weekly
pants also agreed to work towards a FLOSSHC event at Medinfo 2010, as well as encouraging related submissions to the scientific programme. The discussions built on those at the EFMI Special Topic Conference on open source in healthcare held in London in September 2008 (http://www.chirad.info/efmi_stc), and at the OSEHC 2009 (Open Source in European Health Care) event held in Portugal in January 2009. A workshop at MIE2009 (http://www.mie2009.org) in Sarajevo, Bosnia and Herzegovina in August 2009 will also seek input to the nature of the collaborations needed and encourage further participation. Courtesy: Thomas Karopka, Co-chair, EFMI Libre/free and Open Source Health Informatics Working Group
Methods Inf Med 3/2009
V
IMIA News
News from the International Medical Informatics Association
VI
News from the International Medical Informatics Association
IMIA News
TIGER Moves into Phase III: Implementation The Technology Informatics Guiding Education Reform (TIGER) initiative is working to help the USA realise its 10-year goal of electronic health records for all its citizens, primarily through enabling practising nurses and nursing students to fully engage in the unfolding digital electronic era in healthcare. TIGER, which started as a grassroots initiative, now involves over 70 professional nursing organizations, vendors, and governmental entities. The purpose of the initiative is to identify information/knowledge management best practices and effective technology capabilities for nurses. TIGER’s goal is to create and disseminate local and global action plans that can be duplicated within nursing and other multidisciplinary healthcare training and workplace settings. Since 2007, TIGER has successfully brought together hundreds of volunteers in nine collaborative teams to address these areas listed above. The Alliance for Nursing Informatics (ANI) as the enabling organization for TIGER has developed a communi-
cation plan to distribute the TIGER Call to Action Reports. We encourage the organizations that have been involved in the TIGER initiative to each develop their own communication plans. The first two phases have reached conclusion, i.e. phase I engaged stakeholders to create a common vision of electronic health record-enabled nursing practice, while phase II facilitated collaboration among participating organizations to achieve the vision. The TIGER Phase II Executive Summary can be accessed at http://www.tigersummit.com. In 2009, TIGER moves forward, in phase III, to integrate the full set of recommendations from the reports of the nine collaborative teams into the nursing community along with colleagues from disciplines across the continuum of care. This will include: ● developing a U.S. nursing workforce capable of using electronic health records to improve the delivery of healthcare; ● engaging more nurses in the development of a Nationwide Health Information Technology infrastructure;
IMIA Board and General Assembly Meeting, November 2009 The IMIA Board and General Assembly will meet at the International Conference Center, Peace Memorial Park, Hiroshima, Japan in November 2009 as part of the Collaborative Meetings on Health Informatics week, CoMHI in Hiroshima 2009 (http://home. hiroshima-u.ac.jp/~humind1/comhi2009/). In addition to the IMIA Board meeting ( November 24, 2009) and the IMIA General Assembly meeting (November 25, 2009), a host of other events will take place. The APAMI 2009 (Asia Pacific Association for Medical
Methods Inf Med 3/2009
Informatics 2009) meeting, with the theme “What are medical records for?” will take place on November 22–24, 2009; the 29th JCMI (Joint Conference on Medical Informatics) occurs on November 22 – 25, 2009, with the theme “Trailblazer as the keystone of global health informatics with optimum contributions in security, quality and satisfaction to the World”; and the IMIA Working Group on Security in Health Information Systems (SiHIS) will meet on November 21 – 24, 2009 for a conference titled
●
●
accelerating adoption of smart, standardbased, interoperable, patient-centred technology that will make healthcare delivery safer, more efficient, timely, accessible, and efficient; developing action in crucial areas, including standards and interoperability, informatics competencies, education and faculty development, usability and clinical application design, a “Virtual Demonstration Center”, consumer empowerment and personal health records.
The TIGER project is being lead by the Alliance for Nursing Informatics (ANI – http://www.allianceni.org), which is cosponsored by AMIA and HIMSS, represents more than 5,000 nurses and brings together 26 distinct nursing informatics groups in the United States. ANI crosses academia, practice, industry, and nursing speciality boundaries and works in collaboration with the nearly three million nurses in practice today. Courtesy: Marion Ball, IMIA Past President and Honorary Fellow; Member, TIGER Advisory Council
“Trustworthiness of health information ... Issues in security and system management for patient safety”. It promises to be a packed and exciting week, and an ideal opportunity to interact with health informatics colleagues from around the world.
Contact details: For IMIA and IMIA News: Dr. Peter J. Murray Associate Executive Director E-mail: [email protected] http://www.imia.org
© Schattauer 2009