CHALLENGES AND OPPORTUNITIES OF HEALTHGRIDS
Studies in Health Technology and Informatics This book series was started in 1990 to promote research conducted under the auspices of the EC programmes’ Advanced Informatics in Medicine (AIM) and Biomedical and Health Research (BHR) bioengineering branch. A driving aspect of international health informatics is that telecommunication technology, rehabilitative technology, intelligent home technology and many other components are moving together and form one integrated world of information and communication media. The complete series has been accepted in Medline. Volumes from 2005 onwards are available online. Series Editors: Dr. J.P. Christensen, Prof. G. de Moor, Prof. A. Famili, Prof. A. Hasman, Prof. L. Hunter, Dr. I. Iakovidis, Dr. Z. Kolitsi, Mr. O. Le Dour, Dr. A. Lymberis, Prof. P.F. Niederer, Prof. A. Pedotti, Prof. O. Rienhoff, Prof. F.H. Roger France, Dr. N. Rossing, Prof. N. Saranummi, Dr. E.R. Siegel, Dr. P. Wilson, Prof. E.J.S. Hovenga, Prof. M.A. Musen and Prof. J. Mantas
Volume 120 Recently published in this series Vol. 119. J.D. Westwood, R.S. Haluck, H.M. Hoffman, G.T. Mogel, R. Phillips, R.A. Robb and K.G. Vosburgh (Eds.), Medicine Meets Virtual Reality 14 – Accelerating Change in Healthcare: Next Medical Toolkit Vol. 118. R.G. Bushko (Ed.), Future of Intelligent and Extelligent Health Environment Vol. 117. C.D. Nugent, P.J. McCullagh, E.T. McAdams and A. Lymberis (Eds.), Personalised Health Management Systems – The Integration of Innovative Sensing, Textile, Information and Communication Technologies Vol. 116. R. Engelbrecht, A. Geissbuhler, C. Lovis and G. Mihalas (Eds.), Connecting Medical Informatics and Bio-Informatics – Proceedings of MIE2005 Vol. 115. N. Saranummi, D. Piggott, D.G. Katehakis, M. Tsiknakis and K. Bernstein (Eds.), Regional Health Economies and ICT Services Vol. 114. L. Bos, S. Laxminarayan and A. Marsh (Eds.), Medical and Care Compunetics 2 Vol. 113. J.S. Suri, C. Yuan, D.L. Wilson and S. Laxminarayan (Eds.), Plaque Imaging: Pixel to Molecular Level Vol. 112. T. Solomonides, R. McClatchey, V. Breton, Y. Legré and S. Nørager (Eds.), From Grid to Healthgrid Vol. 111. J.D. Westwood, R.S. Haluck, H.M. Hoffman, G.T. Mogel, R. Phillips, R.A. Robb and K.G. Vosburgh (Eds.), Medicine Meets Virtual Reality 13 Vol. 110. F.H. Roger France, E. De Clercq, G. De Moor and J. van der Lei (Eds.), Health Continuum and Data Exchange in Belgium and in the Netherlands – Proceedings of Medical Informatics Congress (MIC 2004) & 5th Belgian e-Health Conference
ISSN 0926-9630
Challenges and Opportunities of HealthGrids Proceedings of Healthgrid 2006
Edited by
Vicente Hernández and
Ignacio Blanquer With Tony Solomonides, Vincent Breton and Yannick Legré
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
© 2006 The authors. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 1-58603-617-3 Library of Congress Control Number: 2006925644 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: order@iospress.nl Distributor in the UK and Ireland Gazelle Books Services Ltd. White Cross Mills Hightown Lancaster LA1 4XS United Kingdom fax: +44 1524 63232 e-mail: sales@gazellebooks.co.uk
Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: iosbooks@iospress.com
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
v
Introduction HealthGrid 2006 (http://valencia2006.healthgrid.org) is the fourth edition of this open forum for the integration of Grid Technologies and its Applications in the Biomedical, Medical and Biological domains to pave the path to an International Research Area in HealthGrid. The main objective of HealthGrid conference and the HealthGrid Association is the exchange and discussion of ideas, technologies, solutions and requirements that interest the Grid and the Life-Sciences communities to foster the integration of Grids into Health. Participation is encouraged for Grid middleware and Grid applications developers, Biomedical and Health Informatics users and security and policy makers to participate in a set of multidisciplinary sessions with a common concern on the applications to Health. HealthGrid conferences have been organized in an annual basis. The first conference, held in 2003 in Lyon (http://lyon2003.healthgrid.org), reflected the need to involve all actors – physicians, scientists and technologists – who might play a part in the application of Grid technology to Health, whether health care or bio-medical research. The second conference, held in Clermont-Ferrand in January 2004 (http://clermont2004.healthgrid.org) reported research and work in progress from a large number of projects. The third conference of Oxford (http://oxford2005.healthgrid. org) had a major concern on the results and deployment strategies in Healthcare. Finally, this issue aims at consolidating the collaboration among Biologists, Healthcare professionals and Grid Technology experts. The conference includes a number of high-profile keynote presentations complemented by a set of high quality refereed papers. The number of contributions to this conference has increased from previous editions, reaching the number of 44 submissions of papers and demos from principal authors coming from 14 countries (according to the number of contributions: France, United Kingdom, Spain, Italy, Germany, Greece, The Netherlands, Belgium, Czech Republic, Cuba, Japan, Romania, Russia and Taiwan). Considering the affiliations of all the authors of the papers, the number of contributing countries is extended to 18 countries including Switzerland, Austria, Turkey and USA. The contributions of this edition follow mainly five main topics: Medical Imaging on the Grid; Ethical, Legal and Privacy Issues on HealthGrids; Bioinformatics on the Grid; Knowledge Discovery on HealthGrids and Medical Assessment and HealthGrid Applications. The maturity of the discipline of HealthGrids is clearly reflected on these subjects. There are more contributions related to two main application areas (Medical Imaging and Bioinformatics), confirming the analysis of the HealthGrid White Paper published last year, which outlined them as the two more promising areas for HealthGrids. Along with these two areas, the assessment on the results of HealthGrid applications, also focused by several contributions, denotes also the maturity of HealthGrids. Finally the other two areas (Knowledge Discovery and Ethical, Legal and Privacy Issues) focus on basic technologies which are very relevant for HealthGrids. In Medical Imaging, the different contributions covered the problems of medical image processing and virtual distributed storages. In this topic there are contributions
vi
focusing on the structuring of medical information through semantic classifications, as in the case of the NeuroBase project presented by Barillot et al. or in the case of the TRENCADIS software architecture presented by Blanquer et al. The problem of encryption and data sharing is a very important topic addressed in contributions such as the Medical Data Manager (Montagnat et al.) and other contributions related with privacy. In the area of medical image processing several papers describe their experiences on providing services for neuroimaging. The work of Olabarriaga et al. cover image processing services for FMRI (Functional Magnetic Resonance Imaging), and Bagnasco et al. describe the application of Grid for the early diagnosis of Alzheimer’s disease by assisting the diagnosis on PET / SPECT through Statistical Parametric Mapping and on the highly-computational problem of Fibre Tracking (Bucur et al.). On the area of modelling processes related with medical images, Bellet et al. proposes a web interface for MRI devices simulation, and Blanquer et al. proposes a Grid implementation of processing services for co-registration of medical images for assessing a quantitative diagnosis of liver cancer. In this precise topic of image co-registration, Montagnat et al. propose a mechanism to evaluate the quality of co-registration methods using the Grid, in a methodology called “Bronze Standard”. Finally, the problem of interactive use of Grids for medical image processing is tackled in the work of GermainRenaud et al. In the area of Ethical, Legal and Privacy Issues on HealthGrids, on one side, contributions focus on ethical and legal issues, such as the problem of medical consent (Herveg et al.) and the organisation of Virtual Organisations for clinical trials in epidemiology (Sinnott et al.). On the other side, different technical solutions for privacy enhancement are presented. In the work of Torres et al., a solution for sharing an encrypted and distributed storage of medical images is presented. A similar approach is used by Blanchet et al. to propose a mechanism for encrypting genetic information. Other approaches for sharing and linking for distributed repositories of epidemiological data are presented in Ainsworth et al. and Tashiro et al. The area of Bioinformatics is a very active one in HealthGrids. The increasing on size and complexity of genomic databases and protein modelling is opening the door to new Grid applications. Results in large-scale in-silico docking for malaria is presented in Jacq et al. and a grid-enabled protein structure prediction system namely Rokky-G is presented in the work of Masuda et al. Other important activity on this topic is the integration of bioinformatics information where the complexity of browsing data is also considered by Schroeder et al. in the frame of the Sealife project. Other approach based in data mediation is presented by Colonna et al. for predisposition Genes discovery. The integration of OGSA-DAI technologies for biochemical distributed data is proposed in Tverdokhlebov et al. The development of genomic processing services and its interfacing to Grid is presented in the porting of the GPS@ portal (Blanchet et al.), and in the work of Segrelles et al., in which an MPIBlast processing Grid service is developed and integrated in a Gene Annotation tool (Blast2GO). The early results of the BIOINFOGRID project are presented in the work of Milanesi et al. More consolidated results on bioprofiling are presented on the work of Sun et al. in the frame of the BIOPATTERN project. Finally, an application of HealthGrid to SARS is described in the work of Hung et al. In the area of Knowledge Discovery on HealthGrids, contributions focus on the semantic integration of medical information. The work of Boniface et al., in the frame of the ARTEMIS project, focus on Healthcare data, whereas the work of Koutkias et al.
vii
focus on semantic integration of bioinformatics data. The semantic integration is the key for knowledge discovery in large databases, in which techniques such as Data Mining are applied. Tsiknakis et al. propose the use of these techniques for cancer study on the ACGT IP, and McClatchey et al. apply those techniques for the integration of paediatric information in the frame of the Health-e-child project. In the area of Medical Assessment and HealthGrid Applications, covers, on one side, medical results of the application of Grid technologies to Health and other applications related to biomedical simulation and clinical environments. The application of Grids to radiotherapy is also a classic topic due to the maturity of High Energy Physics, revealing new applications of the MonteCarlo simulation to Intensity-Modulated Radiation Therapy (Gómez et al.) and interfacing to well-known environments such as GATE (Thiam et al.). Other applications of P2P and Grid technologies show their potential for emergency management (Harrison et al.), and collaborating environments (Kuba et al.). Finally, contributions also focus on the needs of hospital management systems for Grids (Graschew et al.), the success stories of e-DiaMoND and NeuroGrid projects (Ure et al.) and the exploitation of successful projects on Medical Imaging and Grids, such as the MAMMOGRID project (del Frate et al.).
ACKNOWLEDGEMENTS The editors would like to express their gratitude to the Programme Committee and the reviewers; each paper was read by at least two reviewers, including the editors. The editors want to thank the remarkable work that the staff of the HealthGrid association has invested in these conference proceedings and on the organisation of the conference, especially Yannick Legré. Opinions expressed in these proceedings are those of individual authors and editors, and not necessarily those of their institutions.
viii
Healthgrid 2006 Programme Committee Vicente Hernández Ignacio Blanquer Vincent Breton Jose Maria Carazo Andres Santos Antonio Sousa Armando Padhila Emmanuelle Ifeachor Ferran Sanz Alfonso Jaramillo Fabrizio Gagliardi Carlos Martinez Johan Montagnat Tony Solomonides Richard McClatchey Martin Hofmann Howard Bilofsky Petra Wilson Simon Robinson Paulo Bisch Luis Núñez de Villavicencio Chun-Hsi Huang Mary Kratz
Universidad Politécnica de Valencia, Spain Universidad Politécnica de Valencia, Spain Centre National de la Recherche Scientifique, France Centro Nacional de Biotecnología, Spain Universidad Politécnica de Madrid, Spain Instituto de Engenharia Electrónica e Telemática de Aveiro, Portugal Faculdade de Engenharia da Universidade do Porto, Portugal University of Plymouth, United Kingdom Universitat Pompeu Fabra, Spain École Polytechnique, France Microsoft, Switzerland Generalitat Valenciana, Spain Institut National de Recherche en Informatique et Automatique, France University of the West of England, United Kingdom University of the West of England, United Kingdom Fraunhofer Institut für Algorithmen und Wissenschaftliches Rechnen SCAI, Germany University of Pennsylvania, USA CISCO, Belgium Empirica GmbH, Germany Universidade Federal do Rio de Janeiro, Brasil Universidad de los Andes, Venezuela University of Connecticut, USA University of Michigan, USA
ix
Contents Introduction Healthgrid 2006 Programme Committee
v viii
Part I. Medical Imaging on the Grid Federating Distributed and Heterogeneous Information Sources in Neuroimaging: The NeuroBase Project C. Barillot, H. Benali, M. Dojat, A. Gaignard, B. Gibaud, S. Kinkingnéhun, J.-P. Matsumoto, M. Pélégrini-Issac, E. Simon and L. Temal Bridging Clinical Information Systems and Grid Middleware: A Medical Data Manager Johan Montagnat, Daniel Jouvenot, Christophe Pera, Ákos Frohner, Peter Kunszt, Birger Koblitz, Nuno Santos and Cal Loomis Grid Scheduling for Interactive Analysis Cécile Germain-Renaud, Romain Texier, Angel Osorio and Charles Loomis Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture: A Web Portal Design F. Bellet, I. Nistoreanu, C. Pera and H. Benoit-Cattin
3
14
25
34
Towards a Virtual Laboratory for fMRI Data Management and Analysis Silvia D. Olabarriaga, Aart J. Nederveen, Jeroen G. Snel and Robert G. Belleman
43
Service-Oriented Architecture for Grid-Enabling Medical Applications Anca Bucur, René Kootstra, Jasper van Leeuwen and Henk Obbink
55
Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation of Statistical Parametric Mapping Analysis S. Bagnasco, F. Beltrame, B. Canesi, I. Castiglioni, P. Cerello, S.C. Cheran, M.C. Gilardi, E. Lopez Torres, E. Molinari, A. Schenone and L. Torterolo Using the Grid to Analyze the Pharmacokinetic Modelling After Contrast Administration in Dynamic MRI Ignacio Blanquer, Vicente Hernández, Daniel Monleón, José Carbonell, David Moratal, Bernardo Celda, Montse Robles and Luis Martí-Bonmatí Medical Image Registration Algorithms Assessment: Bronze Standard Application Enactment on Grids Using the MOTEUR Workflow Engine Tristan Glatard, Johan Montagnat and Xavier Pennec
69
82
93
x
Part II. Ethical, Legal and Privacy Issues on HealthGrids The Ban on Processing Medical Data in European Law: Consent and Alternative Solutions to Legitimate Processing of Medical Data in HealthGrid Jean Herveg
107
Development of Grid Frameworks for Clinical Trials and Epidemiological Studies Richard Sinnott, Anthony Stell and Oluwafemi Ajayi
117
Privacy Protection in HealthGrid: Distributing Encryption Management over the VO Erik Torres, Carlos de Alfonso, Ignacio Blanquer and Vicente Hernández
131
Secured Distributed Service to Manage Biological Data on EGEE Grid Christophe Blanchet, Rémi Mollon and Gilbert Deléage
142
Part III. Bioinformatics on the Grid Demonstration of In Silico Docking at a Large Scale on Grid Infrastructure Nicolas Jacq, Jean Salzemann, Yannick Legré, Matthieu Reichstadt, Florence Jacq, Marc Zimmermann, Astrid Maaß, Mahendrakar Sridhar, Kasam Vinod-Kusam, Horst Schwichtenberg, Martin Hofmann and Vincent Breton A Gridified Protein Structure Prediction System “Rokky-G” and Its Implementation Issues Shingo Masuda, Minoru Ikebe, Kazutoshi Fujikawa and Hideki Sunahara Sealife: A Semantic Grid Browser for the Life Sciences Applied to the Study of Infectious Diseases Michael Schroeder, Albert Burger, Patty Kostkova, Robert Stevens, Bianca Habermann and Rose Dieng-Kuntz Advancing of Russian ChemBioGrid by Bringing Data Management Tools into Collaborative Environment Alexey Zhuchkov, Nikolay Tverdokhlebov and Alexander Kravchenko GPS@ Bioinformatics Portal: From Network to EGEE Grid Christophe Blanchet, Vincent Lefort, Christophe Combet and Gilbert Deléage Blast2GO Goes Grid: Developing a Grid-Enabled Prototype for Functional Genomics Analysis G. Aparicio, S. Götz, A. Conesa, D. Segrelles, I. Blanquer, J.M. García, V. Hernández, M. Robles and M. Talon
155
158
167
179 187
194
Bioprofiling over Grid for eHealthcare L. Sun, P. Hu, C. Goh, B. Hamadicharef, E. Ifeachor, I. Barbounakis, M. Zervakis, N. Nurminen, A. Varri, R. Fontanelli, S. Di Bona, D. Guerri, S. La Manna, K. Cerbioni, E. Palanca and A. Starita
205
SARS Grid—An AG-Based Disease Management and Collaborative Platform Shu-Hui Hung, Tsung-Chieh Hung and Jer-Nan Juang
217
xi
Part IV. Knowledge Discovery on HealthGrids A Secure Semantic Interoperability Infrastructure for Inter-Enterprise Sharing of Electronic Healthcare Records Mike Boniface, E. Rowland Watkins, Ahmed Saleh, Asuman Dogac and Marco Eichelberg Constructing a Semantically Enriched Biomedical Service Space: A Paradigm with Bioinformatics Resources Vassilis Koutkias, Andigoni Malousi, Ioanna Chouvarda and Nicos Maglaveras Building a European Biomedical Grid on Cancer: The ACGT Integrated Project M. Tsiknakis, D. Kafetzopoulos, G. Potamias, A. Analyti, K. Marias and A. Manganas Health-e-Child: An Integrated Biomedical Platform for Grid-Based Paediatric Applications Joerg Freund, Dorin Comaniciu, Yannis Ioannis, Peiya Liu, Richard McClatchey, Edwin Morley-Fletcher, Xavier Pennec, Giacomo Pongiglione and Xiang (Sean) Zhou
225
236
247
259
Part V. Medical Assessment and HealthGrid Applications Grid Empowered Sharing of Medical Expertise Martin Kuba, Ondřej Krajíček, Petr Lesný, Jan Vejvalka and Tomáš Holeček
273
Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations Andrew Harrison, Ian Kelley, Emil Mieilica, Adina Riposan and Ian Taylor
283
Virtual Hospital and Digital Medicine – Why Is the GRID Needed? Georgi Graschew, Theo A. Roelofs, Stefan Rakowsky, Peter M. Schlag, Paul Heinzlreiter, Dieter Kranzlmüller and Jens Volkert
295
Final Results and Exploitation Plans for MammoGrid Chiara del Frate, Jose Galvez, Tamas Hauer, David Manset, Richard McClatchey, Mohammed Odeh, Dmitry Rogulin, Tony Solomonides and Ruth Warren
305
Part VI. Posters and Short Contributions Proposing a Roadmap for HealthGrids Vincent Breton, Ignacio Blanquer, Vicente Hernández, Yannick Legré and Tony Solomonides
319
Remote Radiotherapy Planning: The eIMRT Project 330 Andrés Gómez, Carlos Fernández Sánchez, José Carlos Mouriño Gallego, Francisco J. González Castaño, Daniel Rodríguez-Silva, Javier Pena García, Faustino Gómez Rodríguez, Diego González Castaño and Miguel Pombar Cameán
xii
Designing for e-Health: Recurring Scenarios in Developing Grid-Based Medical Imaging Systems 336 John Geddes, Clare Mackay, Sharon Lloyd, Andrew Simpson, David Power, Douglas Russell, Marina Jirotka, Mila Katzarova, Martin Rossor, Nick Fox, Jonathon Fletcher, Derek Hill, Kate McLeish, Yu Chen, Joseph V. Hajnal, Stephen Lawrie, Dominic Job, Andrew McIntosh, Joanna Wardlaw, Peter Sandercock, Jeb Palmer, Dave Perry, Rob Procter, Jenny Ure, Mark Hartswood, Roger Slack, Alex Voss, Kate Ho, Philip Bath, Wim Clarke and Graham Watson Design and Implementation of Security in a Data Collection System for Epidemiology John Ainsworth, Robert Harper, Ismael Juma and Iain Buchan
348
Architecture of Authorization Mechanism for Medical Data Sharing on the Grid Takahito Tashiro, Susume Date, Singo Takeda, Ichiro Hasegawa and Shinji Shimojo
358
Database Integration for Predisposition Genes Discovery François-Marie Colonna, Yacine Sam and Omar Boucelma
368
High Performance GRID Based Implementation for Genomics and Protein Analysis L. Milanesi and I. Merelli
374
TRENCADIS – A WSRF Grid MiddleWare for Managing DICOM Structured Reporting Objects Ignacio Blanquer, Vicente Hernández and Damià Segrelles
381
GATE Simulation for Medical Physics with Genius Web Portal C.O. Thiam, L. Maigne, V. Breton, D. Donnarieix, R. Barbera and A. Falzone
392
Biomedical Applications in EELA Miguel Cardenas, Vicente Hernández, Rafael Mayo, Ignacio Blanquer, Javier Perez-Griffo, Raul Isea, Luis Nuñez, Henry Ricardo Mora and Manuel Fernández
397
Outlook for Grid Service Technologies Within the @neurIST eHealth Environment A. Arbona, S. Benkner, J. Fingberg, A.F. Frangi, M. Hofmann, D.R. Hose, G. Lonsdale, D. Ruefenacht and M. Viceconti Author Index
401
405
Part I Medical Imaging on the Grid
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
3
Federating Distributed and Heterogeneous Information Sources in Neuroimaging: The NeuroBase Project C. Barillota, H. Benalib, M. Dojatc, A. Gaignarda, B. Gibauda, S. Kinkingnéhunb, J-P. Matsumotod, M. Pélégrini-Issacb, E. Simond, L.Temala a Visages U746, INSERM-INRIA-CNRS-Univ-Rennes1, IRISA, Rennes, France b IFR 49, CHR La Pitié Salpetrière/CEA-SHFJ, Paris, Orsay, France c Unité INSERM U594, Grenoble, France d Business Objects/Médience, Levallois-Perret, France
Abstract. The NeuroBase project aims at studying the requirements for federating, through the Internet, information sources in neuroimaging. These sources are distributed in different experimental sites, hospitals or research centers in cognitive neurosciences, and contain heterogeneous data and image processing programs. More precisely, this project consists in creating of a shared ontology, suitable for supporting various neuroimaging applications, and a computer architecture for accessing and sharing relevant distributed information. We briefly describe the semantic model and report in more details the architecture we chose, based on a media-tor/wrapper approach. To give a flavor of the future deployment of our architecture, we de-scribe a demonstrator that implements the comparison of distributed image processing tools applied to distributed neuroimaging data Keywords. Medical Image Data bases, Mediation Systems, Mediator/Wrappers, Neuroimaging, Semantic Web, Medical Ontology
1. Introduction One objective of neuroscientists is the construction of functional cerebral maps under normal and pathological conditions. Researches are currently performed to find correlations between anatomical structures, essentially sulci and gyri, where neuronal activation takes place, and cerebral functions, as assessed by recordings obtained by means of various neuroimaging mo-dalities, such as PET (Positron Emission Tomography), fMRI (functional Magnetic Resonance Imaging), EEG (ElectroEncephaloGraphy) and MEG (Magneto-EncephaloGraphy). Formation of such correlations maps requires the development of sophisticated image processing techniques, such as segmentation and modeling of anatomical structures, registration and multi-modality fusion, and specific methods for longitudinal data analysis. Two of the major concerns of researchers and clinicians involved in neuroimaging experiments are on one hand, to manage internally the huge quantity of produced data ( 1 Gb per subject) and, on the other hand, to be able to confront their experiences and the programs they develop with those existing in other centers or, moreover, with those described in publications. Fur-thermore, and this is more particularly true for medium size centers (with limited staff capabili-ties), or even small ones (it is mostly the case in
4
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
clinical centers), the researchers or the clini-cians have great difficulties to set up largescale experiments, mainly due to the lack of man power and capacities of recruiting subjects. Besides, the statistical validity of the results is sometimes insufficient (the rate of "false negative" is probably not negligible). For all these reasons, we believe that pooling experimental results, through a network between collaborative centers, will widen the scientific achievement of the conducted experimental studies. Through distributed neuroimaging data bases, the search for similar results, the search for images con-taining singularities or transverse searches via data mining techniques could highlight possible regularities. Moreover, this will broaden also the possible panel of people involved in neuroi-maging studies, while protecting the excellence of the supplied work. In this context, NeuroBase is a cooperative project that is aimed at establishing the conditions allowing, through the Internet, the federation of distributed information sources in neuroimag-ing, these sources being located in various centers of experimentation, clinical departments in neurology, or research centers in cognitive neurosciences. This requires that the users can diffuse, exchange or reach neuroimaging information with ap-propriate access means, in order to be able to retrieve information almost as easily as if it were stored locally. 1.1. Background Due to of the explosion of data generated by the neurosciences community, early in the 90's has appeared the imperative necessity for innovative techniques for data and knowledge sharing and reuse [1,2]. This led to the starting of the North American ambitious "Human Brain Mapping" project. An objective recently added to this project is the development of data analysis and data processing software to operate on various data repository systems for data mining and knowledge discovery purposes. In parallel the development of web applications has stimulated the interest of researchers for distributed databases and information sharing. Four research topics are particularly relevant for our project: 1. Digital and probabilistic atlases of brain. To gather and share neuroimaging information in a common referencel space, various research efforts are performed for the construction of digital atlases: based on the labeling of post-mortem brains to quantify individual anatomical variability of cortical regions [3], for the anatomy and brain functions of rats [4] or of the primate visual system [5], or to associate symbolic data and graphical data about the nervous system [6]. Some atlases are developed to support interpretation of functional data [7], image processing instantiation in a specific context [8] or training [9]. For probabilistic atlases, some 300 MRI brain scans plus post morten data of 30 subjects have been mixed in a common referential by the International Consortium for Brain Mapping [10]. Several image processing tools have been added to allow segmentation and mapping of brain images to this brain reference. 2. Conception of image processing tools. The BRAID1 project at Hopkins University is relevant here. It explores the anatomy-function relationship based on activationresponse experiments and deficit-lesion analysis. The proposed system integrates mechanisms for complex queries, combining selection with multiple criteria, 1
BRAin Image Database, http://braid.uphs.upenn.edu/websbia/braid/
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
3.
4.
5
images quantification, and statistical tests to calculate correlations between deficits and lesions. Group studies rely on matching all brains to a target (reference) by the means of linear or non-linear (deformable elastic model) matching methods, each with its own pros and cons. Several participants of this project have a well-known experience on the conception of such robust image processing tools. Multi-center databases. Several laboratories belonging to the Illinois University participate to the constitution of a commonly shared database devoted to neuronal patterns recordings. This work, oriented to animal recordings, is close to our project. The database is used for instance, to find temporal series specific to neurons populations under various stimuli conditions. A common data model has been developed to organize the experimental data. An atlas is available to enter, search and analyze heterogeneous data in a common referential. Ontology sharing and data schemata updating facilities have been also explored in the context of cooperative federated databases [11]. Infrastructures for sharing data and processing tools. Several projects such as IXI [12] or Mammogrid [13] explore how grid technology can be applied to the field of medical image analysis by using large collections of computer resources to facilitate and scale processing across sites. The architectures proposed allow image processing algorithms to be exposed as Grid services with the ability to compose these services as complex workflows executed across distributed resources. The notion of pipelines for the sequencing of image processing algorithms is also present in the LONI [14] or BrainVISA [15] frameworks.
2. The NeuroBase Approach Instead of gathering all data in a central database [16], NeuroBase promotes a federated system for the management of distributed and heterogeneous sources of information. The goal of the system is to allow the sharing of two types of information: on the one hand neuroimaging data, typically results from neuroimaging experiments, on the other hand data processing programs, typically image processing programs or statistical tools, being applied to the data available in the distributed system. Data can then be stored in relational databases or just in local files (wrappers will find their own way to the information). Image processing programs are modeled by the use of data flows. A dataflow specifies the inputs and the parameters required for completing of a given processing method, and the outputs of this procedure. Then, one of the most important aspects in this project is to identify the main concepts shared by the different information centers in order to define a common semantic model every site can subscribe to (see Figure 1). From this base line, each site participating to the federated system can map its own concepts, data, image processing programs and ontology, to this semantic referential [11]. For this purpose, we rely on a mediator/wrapper approach [17], where both the integration of both (i) anatomical and functional images and related data (e.g. experimental protocols or subjects, pathology) and (ii) image processing programs, which can be applied to the images, (e.g. segmentation, registration, statistical analysis, …) can be expressed.
6
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
Figure 1: The NeuroBase system architecture for managing distributed information sources in neuroimaging. Mediation services are used to map and retrieve local information stored in heterogeneous and distributed databases following user queries expressed using concepts from the shared ontological model.
2.1. The Mediator/Wrapper approach Mediators are systems for the mediation of information that have been introduced to allow the virtual integration of heterogeneous distributed information sources in cooperative federated database systems. Mediators differ from standard database management systems in several aspects. Firstly, they do not supply mechanisms for updating simultaneous sources of information. They only support queries to information sources in order to preserve their autonomy and the fact that they are locally managed. Secondly, to reinforce interoperability, to be highly adaptive to data structures encountered in databases, mediators support various data models from standard structured data, such as relational, object or multi-dimensional models, to semi-structured models, such as XML. The architecture of mediators is also different, based on a "mediator/wrapper" concept [17], in which a mediator offers a central view about all sources of information and the associated wrappers, dedicated to each source, hide their heterogeneity. Using the corresponding wrappers, a mediator redefines the user query into source dependent queries, then recomposes the various responses and formats the final response to the user. The query redefinition in sub-queries is optimized by means of a cost based model to obtain the most efficient execution plan. This architecture clearly specifies the respective role of the mediator, which processes the user queries, and the wrappers, which translate the sub-queries into the relevant format for the associated source of information. The pragmatic interest of such architecture is to lower the amount of work linked to the introduction of a new source of information to the creation of the corresponding wrapper. Several mediators have
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
7
been developed already (for instance DISCO [18], and Mocha [19]). Since 1998, one of the project’s participants have developed a new generation of mediators, called Le Select2, which allows one to share distributed, heterogeneous and autonomous data and programs via a high level query language. Le Select is the cornerstone of the NeuroBase approach.
Figure 2: Excerpt of the NeuroBase ontology. Some concepts that appear in the text are shown in italic.
2.2. The Semantic Model This semantic model or ontology has to be defined by a collaborative community, which requires quite a lot of work since there exists no fully defined common ontology from which we can derive our semantic referential. We have to build it in a domain which is complex and not well defined. Some existing works can provide valuable inputs such as the fMRI data center ontology 3 and medical thesauri such as the “Neuronames” terminology [20]. Parts of our efforts were related to the design of this ontology (see Figure 2). Briefly, the ontology “relies on/is made of” concepts that represent the relevant entities and their associated properties and supply the search criteria susceptible to support user queries, such as a Subject or a GroupOfSubjects with or without a specific PersistentPathologyAssessment involved in a Study. Corresponding Dataset of Anatomical and Functional images with their AcquisitionProtocol, DataProcessing methods and InterpretationOfDatasetComponent (e.g. labels corresponding to anatomical entities, mesh, probabilistic information …) are described. Concepts have been introduced to cover at least the specific applications addressed by the NeuroBase contributors (epilepsy, visual cortex exploration and Alzheimer disease).
2 3
http://www-caravel.inria.fr/~leselect/ http://www.fmridc.org/f/fmridc/aboutus/index.html
8
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
3. The NeuroBase Demonstrator In order to evaluate our architecture, we have recently built a demonstrator based on some existing modules like Le Select, BrainVISA/Anatomist4, MRIcro5, FSL6 and Vistal7. This can be extended to modules largely used in neurosciences communities such as SPM8 software. 3.1. The test-bed application The purpose of the test-bed application is to demonstrate that the NeuroBase architecture can support, via the Internet, the test and comparison of image processing modules in order for instance to select the most robust. These modules are distributed in several centers and applied on distributed data. Presently, two test centers are involved. A center, C1, located in Grenoble (FR) has developed an image processing chain for the delineation of visual cortical areas including cortex segmentation and unfolding. Image data are acquired on a 3T Bruker scanner in the context of cognitive experiments for visual cortex exploration. They are stored using the Analyze format. A second center, C2, located in Rennes (FR) has developed image processing tools for restoration (denoising and debiasing) and segmentation. Data are mainly acquired in the context of Epilepsy on a 1.5T GE scanner and stored using the GIS format. Figure 3 illustrates the application. First, C2 queries for an anatomical image available at C1 that is locally restored (anisotropic filtering) – i.e. at C2 - after the required format transformation. Then, C2 launches a specific tool for brain extraction. The Bet/FSL algorithm is executed at C1 on the input (a restored image) and provides the corresponding outputs: a brain image and a brain mask (binary image). After format conversion C2 fires locally the tissues segmentation. C2 launches a similar image processing available at C1 (MA_segmentation). Execution is then performed at C1. The two segmented images are then compared using the tool required (difference) at C2. Results are displayed at C2 or at C1 with the local 3D viewer. The same dataflow can be executed on data either from C1 (as in the example) or from C2. The final user does not need to know where the data are stored and where the methods are executed.
4
http://brainvisa.info http://www.psychology.nottingham.ac.uk/staff/cr1/mricro.html 6 http:/www./fmrib.ox.ac.uk.fsl/ 7 http://www.irisa.fr/visages/software-fra.html 8 http://www.fil.ion.ucl.ac.uk/spm/ 5
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
9
Figure 3: The test-bed application: two research centers C1 and C2, physically separated, share via the Internet data and processing tools in order to compare two segmentation methods (Vistal in C2 and MA_segmentation in C1) on an anatomical image previously restored at C2. Segmentation processes are executed separately at each center. No synchronization process is implemented and difference is calculated when data are available. Concepts present in our ontology that correspond to inputs and outputs of our image processing tools are shown in italic.
3.2. The Architecture The overall architecture is shown in Figure 4. The Le Select LeSelect middleware is installed at each center. It is a generic server that includes data and image processing wrappers. Wrappers are site specific. Shared image processing tools are executed based on each local software library environment. Local 2D/3D viewer (here Anatomist9 and MRICro10) can be used. All distributed queries are performed via a common application developed in a Tomcat servlet server environment accessible through a standard web browser. 3.3. The Working Principles Shared anatomical images are stored in local data repository (for C1 in a local file hierarchy and in a PostgreSQL database for C2). To make various queries using services available on our distributed system, wrappers were designed to map the local data organization in C1 and C2 with to the semantic referential. Figure 5 highlights the main mappings between the local file hierarchy in C1 and some concepts.
9
http://brainvisa.info/ http://www.psychology.nottingham.ac.uk/staff/cr1/mricro.html
10
10
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
Similarly wrappers were introduced to execute programs on data published. As relational format is used by Le Select, program wrappers use relational data as input and relational data as output. In the following example, the skull stripping program is executed on images referred as Dataset in the ontology. This command is executed by Tomcat. job execute //$host/fsl/Bet Å execution of bet program. host is set to C1 hostname input a is select Vol_Bin1 as img, Vol_Bin2 as hdr " + Å SQL query to retrieve input files /ontology/AllDataset " + in AllDataset table of ontology where ID = '$DatasetID' Å identifier for all Dataset entities
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
11
INTERNET Firewall
Firewall
Firewall
Firewall
Secure ports
Secure ports
Secure ports
MWS server
MWS server
MWS server
MWS server
C1
C2
C3
C4
Local_NET
Secure ports
Secure ports
Client Demo WebApp C1(w3ext)
TomCat Apache
8080 INTERNET
Connect thru https and passwd
8080
Figure 4: (Top) the NeuroBase demonstrator architecture deployment between two distant centers C1 and C2. WD: data wrapper, WP: image processing program wrapper; (Bottom) The current network implementation of the system between 4 different partners, this underlines the generality of the proposed approach.
Figure 5: Mapping the BALC concepts hierarchy, ie the local database in center C1, to the semantic referential.
12
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
4. Discussion Our preliminary demonstrator shows that the principles and the technology we propose can be used in the context of neuroimaging data. Clearly, it should be extended. Presently, only a small part of the ontology is mapped to local databases. The call to processing tools is hard-coded: selection of inputs and tuning parameters are still limited. The application developed in theTomcat environment should be extended to allow the selection, through a standard web browser, of the processing tools available. Finally, output of images processing tools available at each site are not reintroduced in the corresponding file hierarchy or PostgresSQL database. Such extensions are under development. Neuroimaging is a relatively new scientific discipline in vivid evolution. Many concepts currently used did not exist a few years ago. In this moving area, where no consensus is reached for several concepts, the definition of a centralized database for sharing data and processing methods seems rather complex and requires, if successful, strong man power facilities for its maintenance. Moreover, there is a legitimate desire of autonomy that contradicts the centralized approach. Actually information sources exist in different centers but have generally been set up for purely local needs and are accessible to only a very small user community. In this context, the NeuroBase architecture we propose, based on a mediator/wrapper approach, seems attractive. Our architecture can be used to manage the evolution or even the upcoming of new information sources by just updating wrappers or creating new ones (this somewhat comes to changing or adding new views to the semantic referential). In our approach, the semantic referential is central. It should be flexible enough to accept the introduction of new concepts, while remaining consistent. The AI community, from knowledge engineering to semantic grid has developed a strong expertise in this field via the construction of controlled vocabularies and thesauri that will provide valuable hints. The extensive use and the evolution of our demonstrator will allow us to confront it to different real situations. 5. References [1] Mazziotta J.C., Toga A.W., Evans A.C., Fox P. and Lancaster J.L.: A Probabilistic Atlas of the Human Brain: Theory and Rationale for its development. Neuroimage 2 (1985) 89-10. [2] Roland P.E., Zilles K.: Brain atlases - a new research tool. Trends in neurosciences 17(1994) 458-467. [3] Graf von Keyserlingk, D., Niemann K., and Wasel, J.: A quantitative approach to spatial variation of human cerebral sulci. Acta Anatomica 131 (1988) 127-131. [4] Toga, W.: A Three-Dimensional atlas of structure/function relationships, Journal of chemical neuroanatomy. 4 (1991) 313-318. [5] Van Essen, D.C., Drury, H.A., Joshi, S. and Miller, M.I. Functional and structural mapping of human cerebral cortex: solutions are in the surfaces. Proc Natl Acad Sci U S A, 95 (1998) 788-795. [6] Bloom, F.E.: The multidimensional database and neuroinformatics requirements for molecular and cellular neuroscience, Neuroimage 4 (1996) S12. [7] Seitz, R.J., Bohm, C., Greitz, T., Roland, P.E., and Erikson, L.: Accuracy and precision of the computerized brain atlas programme for the localization and quantification in positron emmision tomography. J. Cerebr Blood Flow Metab. 10 (1990) 443-457.
C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging
13
[8] Lehmann, E.D., Hawkes, D.J., Hill, D.L., Bird, C.F., Robinson, G.P., Colchester, A.C. and Maisey M.N.: Computer-aided interpretation of SPECT images of the brain using an MRI-derived 3D neuro-anatomical atlas. Medical Informatics. 16 (1991) 151-66. [9] Höhne, K.H., Bomans, M., Riemer, M., R. Schubert, R., Tiede, U. and Lierse, W.: A volume based anatomical atlas. IEEE Comp. Graphics and Application. 12 (1992)72-78. [10] Tiede, U., Schiemann, T. and Höhne, K.H.: Visualizing the Visible Human. IEEE computer Graphics and Applications. 16 (1996) 7-9. [11] Kahng J. and McLeod D. Dynamic Classificational Ontologies: Mediation of information sharing in Cooperative Federated Database Systems. In: Papazoglou M.P., Schlageter G. (eds): Cooperative Information Systems: Trends and Directions. Academic Press, London, U.K., (1998) 179-203. [12] Rowland, A., Burns, M., Hartkens, T., Hajnal, J., et al. Information eXtraction from Images (IXI): Image Processing Workflows Using A Grid Enabled Image Database. In: Dojat M., Gibaud, B. (eds): Proceedings of DiDaMIC'04 Workshop. MICCAI conference (St Malo, 26-29 Sept. 04), Rennes (2004) 55-64. [13] Brady, M. Grid-based Federated Databases of Mammograms: Mammogrid and eDiamond experiences. In: Dojat M., Gibaud, B. (eds): Proceedings of DiDaMIC'04 Workshop. MICCAI conference (St Malo, 26-29 Sept. 04), Rennes (2004) 84. [14] Rex, D. E., Ma, J. Q. and Toga, A. W.: The LONI Pipeline Processing Environment. NeuroImage 19 (2003) 1033-48. [15] Cointepas, Y., Mangin, J., Garnero, L., Poline, J., et al. BrainVISA: Software platform for visualization and analysis of multi-modality brain data. In: Proceedings of 7th Human Brain Mapping Conf, Brighton (UK) (2001) S98. [16] Koslow, S.H.: Should the neuroscience community make a paradigm shift to sharing primary data? Nature Neuroscience. 3 (2000) 863-865. [17] Wiederhold G. and Genesereth M.: The Conceptual Basis for Mediation Services. IEEE Expert 12 (1997) 38-47. [18] Tomasic A. et al.. The distributed Information Search Component (Disco) and the World Wide Web. In Proc. of the ACM SIGMOD, Tucson, Arizona, May (1997) 546-548. [19] Rodriguez-Martinez, M., Roussopoulos, N.: MOCHA: a Self-extensible Middleware Substrate for Distributed Data Sources, In: Proc. of the ACM SIGMOD Int Conf on Management of Data, Houston, May 2000. [20] Bowden, D. M., Martin, R. F.: NeuroNames Brain Hierarchy. NeuroImage 2 (1995) 63-83.
14
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Bridging clinical information systems and grid middleware: a Medical Data Manager ´ Johan Montagnat a , Daniel Jouvenot b , Christophe Pera c , Akos Frohner d , d d d Peter Kunszt , Birger Koblitz , Nuno Santos , and Cal Loomis b a CNRS, I3S laboratory b CNRS, LAL laboratory c CNRS, CREATIS laboratory d CERN Abstract. This paper describes the effort to deploy a Medical Data Management service on top of the EGEE grid infrastructure. The most widely accepted medical image standard, DICOM, was developed for fulfilling clinical practice. It is implemented in most medical image acquisition and analysis devices. The EGEE middleware is using the SRM standard for handling grid files. Our prototype is exposing an SRM compliant interface to the grid middleware, transforming on the fly SRM requests into DICOM transactions. The prototype ensures user identification, strict file access control and data protection through the use of relevant grid services. This Medical Data Manager is easing the access to medical databases needed for many medical data analysis applications deployed today. It offers a high level data management service, compatible with clinical practices, which encourages the migration of medical applications towards grid infrastructures. A limited scale testbed has been deployed as a proof of concept of this new service. The service is expected to be put into production with the next EGEE middleware generation.
1. Medical data management in hospitals and grid data management The medical community is routinely using clinical images and associated medical data for diagnosis, intervention planning and therapy follow-up. Medical imagers are producing an increasing number of digital images for which computerized archiving, processing and analysis are needed [8,12]. Indeed, image networks have become a critical component of the daily clinical practice over the years. With their emergence, the need for standardized medical data formats and exchange procedures has grown [2]. For this reason, the Digital Image and COmmunication in Medicine standard (DICOM) [6] was adopted by a large consortium of medical device vendors. Picture Archiving and Communication Systems (PACS) [10], manipulating DICOM images and often other medical data in proprietary formats, are proposed by medical device vendors for managing clinical data. PACS are often proprietary solutions weakly standardized. PACS may be more or less connected to the Hospital Information System (HIS), holding administrative information
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
15
about patients, and Radiological Information Systems (RIS), holding additional information for the radiological departments. The DICOM standard, PACS, RIS and HIS have been developed with clinical needs in mind. They are easing the daily care of the patients and medical administrative procedures. However, their usage in other areas is very limited. The interface with computing infrastructures for instance is almost completely lacking. In addition, current PACS hardly address medical data management needs beyond clinical centers’ administrative boundaries, while the patient medical folders are often wide spread over many medical sites that have been involved in the patient’s healthcare. Many medical image acquisition devices are also weakly conforming to the DICOM standard, thus hardly hiding the heterogeneity of these systems. In the last decades, with the growing availability of digital medical data, many medical data processing and analysis algorithms were developed, enabling computerized medical applications for the benefit of the patient and healthcare practitioners. Although sharing the same data sources, the medical image analysis community has different requirements for medical system than the healthcare community. Many algorithms are developed for processing and producing image files. A common procedure for accessing all medical data sources is needed. Given the enormous amount of medical data produced inside hospitals and the cost of medical data computing (especially image analysis algorithms), grid proved to be very useful infrastructures for a large variety of medical applications [11]. Grids are providing computing resources and workload systems that ease application code deployment and usage. Moreover, grids are providing distributed data management services that are well suited for handling medical data geographically spread throughout various medical centers [5,7,4,9,3]. However, existing grid middlewares are often only dealing with data files and do not provide higher level services for manipulating medical data. Medical data often have to be manually transferred and transformed from hospital sources to grid storage before being processed and analyzed. Such manual interventions are tedious and often limit systematic use of grid infrastructures. In some cases, they may even prevent the use of grids, e.g. when the amount of data to transfer is too large. As a consequence, the first key to the success of the systematic deployment of medical image processing algorithms is to provide a data manager that: • Provides access to medical data sources for computing without interfering with the clinical practice. • Ensures transparency so that accessing medical data does not require any specific user intervention. • Ensures a high data protection level to respect patients privacy. The Medical Data Manager (MDM) service described in this paper was designed to fulfill these constraints. It was developed with the support of the EGEE1 European IST project. The remaining of this paper describes the technical requirements to be addressed for such a service and details the service design. 1 Enabling
Grids for E-sciencE, http://www.eu-egee.org
16
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
2. Clinical usage of medical data The DICOM standard introduced earlier encompasses, among other things, an image format and an image communication protocol. A DICOM image usually contains one slice (a 2D image) acquired using any medical imaging modality (MRI, CT-scan, PET, SPECT, ultrasound, X-ray... [1]). A DICOM image may contain a multi-slice data set but this is rarely encountered. A DICOM image contains both the image data itself and a set of additional information (or metadata) related to the image, the patient, the acquisition parameters and the radiology department. DICOM metadata are stored in fields. Each field is identified by a unique tag defined in the DICOM standard. A given field may be present or absent depending on the imager that produced the image. The standard is open and image device manufacturers tend to use their own fields for various information. A couple of fields (such as image size) are mandatory but experience proved that surprises should be expected when analyzing a DICOM image. The image itself is usually stored as raw data. Most imaging devices produce one intensity value per image pixel, coded in a 12 bit format. Other format may be encountered such as 16 bit data or lossless JPEG. 2.1. DICOM protocol, storage, and security Most (reasonably modern) medical image acquisition device are DICOM clients. DICOM servers are computers with on-disk and/or tape back-ends able to store and retrieve DICOM images. The DICOM protocol defines the communication protocol between DICOM servers and clients. There is no standardization on DICOM storage. DICOM servers are implementing their own policy of data storage. One should not see DICOM data sets as a set of files. As stated above, a single DICOM image usually contains only one image slice. In practice, during a medical examination (a DICOM study), a radiologist acquires several 2D and 3D images, representing up to hundreds to thousands of slices. A study is divided in one or several series and each serie is composed by a set of slices (that can be stacked to assemble a a volume when they belong to the same 3D image). Note that there is often no notion of 3D image encoded in the DICOM format: a serie may contain a set of slices composing several 3D images. The way a DICOM server stores these data sets on disk is irrelevant just like the way a database stores its table is usually not known from the users: the medical user is never exposed to the DICOM storage and does not need to know if different files are used for each DICOM slice, serie, study, etc. Metadata are included in DICOM image headers, making them difficult to manipulate. A DICOM server will often extract these metadata and store them in a database to ease data search. The DICOM security model is rather weak. DICOM files are unencrypted and transported unencrypted. Files contain patient data. The DICOM server security model is based on a per-application basis: all users having access to some DICOM client application can access to the information that the server returns to this specific application. DICOM servers are using random file names without any connection to the patient information and a proprietary data storage policy.
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
17
To cope with these data protection limitations, security is often implemented in hospitals by isolating the images network from the outside world. 2.2. Access to medical images Each image acquisition device is a potential DICOM compliant medical image source. In a radiological department, one or several DICOM servers can be set up to centralize data acquired on this site. Medical data are naturally distributed over the different acquisition sites. In clinical practice, physicians do not access directly to image files. They identify data by associated metadata such as patient name, acquisition date, radiologist name, etc. The data are transferred mainly for visualization purposes. The physician quickly scans the slices stack in the DICOM study and focuses on the slices he or she is interested in.
3. Medical image analysis In the medical image analysis community, the needs are quite different. One often needs to identify images through metadata too, although the search are not necessarily for nominative data but often related to the acquisition type or body region. 3D images are exported to disk files for post-processing and ease of use. Various 3D medical image format may be used to stack different DICOM slices into a single image volume (the most common being the analyze file format). 3.1. Enforcing medical data and security All medical data should be considered as sensitive to preserve patient privacy. Nominative medical data are of course the most critical data and therefore, no binding between nominative data and images should be possible for non accredited users. In clinical practice, this result is often obtained by isolation of the image network. Only physicians participating to one patient healthcare should have access to the data of this patient. On a grid, the distribution of data make security problem very sensitive. To ensure patient privacy, the header of all DICOM images sent by a DICOM server should be wiped out, at least partially, to ensure anonymity. All images that are stored out of the source center should be encrypted to ensure that non accredited users cannot read the image content.
4. Medical Data Management Service 4.1. EGEE grid middleware The EGEE project is currently deploying the LCG2 middleware2 on its production infrastructure. LCG2 is based on GLOBUS2, Condor, and the other services 2 LCG2:
Large hadron collider Computing Grid middleware, http://lcg-web.cern.ch
18
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
developed in the European DataGrid project3 . A new generation middleware, gLite4 , is under testing and should be deployed at Spring 2006. Our Medical Data Manager service (MDM) is based on gLite. The gLite middleware provides workload management services for submitting computing tasks to the grid infrastructure and data management services for managing distributed files. The data management is based on a set of Storage Elements which are storage resources distributed in the various sites participating in the infrastructure (currently, more than 180 sites distributed all over Europe and beyond). All storage elements expose a same interface for interacting with the other middleware services: the Storage Resource Manager interface (SRM) that is standardized in the context of the Global Grid Forum5 . The SRM is handling local data at a file level. It offers an interface to create, fetch, pin, or destroy files among other things. It does not implement data transfer by itself. Additional services such as GridFTP or gLiteIO are coexisting on storage elements to provide transfer capabilities. In addition to storage resources, the gLite data management system includes a File Catalog (Fireman) offering a unique entry point for files distributed on all grid storage elements. Each file is uniquely identified through a Global Unique IDentifier (GUID). The file catalog contains tables associating each GUID to file location. For efficiency and fault tolerance reasons, files may be replicated on different sites. Thus, each GUID may be associated to several locations. To ease the manipulation by users, human readable Logical File Names (LFN) can be associated to each file (each GUID). 4.2. Medical Data Management service design The Medical Data Management service architecture is diagrammed in figure 4.2. On the left, is represented a clinical site: various imagers in an hospital are pushing the images produced on a DICOM server. Inside the hospital, clinicians can access the DICOM server content through DICOM clients. In the center of figure 4.2, the MDM internal logic is represented. On the right side, the grid services interfacing with the MDM are shown. All middleware services requiring access to data storage do so through SRM requests sent to storage elements. To remain compatible with the rest of the grid infrastructure, our MDM service is based on a SRM-DICOM interface software. The SRM-DICOM core is receiving SRM requests and transforms them into DICOM transactions addressed to the medical servers. Thus, medical data servers can be shared between clinicians (using the classical DICOM interface inside hospitals) and image analysis scientists (using the SRM-DICOM interface to access the same data bases) without interfering with the clinical practice. An internal scratch space is used to transform DICOM data into files that are accessible through data transfer services (GridFTP or gLiteIO). 3 European
DataGrid project, http://www.edg.org middleware, http://www.glite.org 5 Global Grid Forum, http://www.ggf.org
4 gLite
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware Hospital DICOM client
Medical Data Manager Encryption
DICOM
DICOM get
Header blanking
SRM−DICOM interface (read only)
Grid Middleware SRM request
layer
Abstraction
Server
DICOM push
gLiteI/O server
Hydra key store Imagers
19
Scratch space
AMGA Metadata manager
Any grid service MDM client library gLiteIO client
Fireman File Catalog GUID LFN key
key DICOM patient acquisition hospital
Secondary SRM (read−write)
gLiteI/O interface
Figure 1. Overview of the medical data manager
A metadata manager is also used to extract DICOM headers information and ease data search. The AMGA6 service [13] is used for ensuring secured storage of these very sensitive data. The AMGA server holds a relation between each DICOM slice and the image metadata. This specialized SRM is not providing a classical Read/Write interface to a storage element. A classical R/W storage element can symmetrically receive grid files to be stored or deliver archived files to the grid on request. In the MDM, The SRM interface only accepts registration request coming internally from the hospital. To avoid interfering with the clinical data, external grid files are not permitted to be registered on the MDM storage space: only get requests are authorized from the grid side. If classical grid storage is desired (with write capability), a classical secondary SRM can be installed on the same host. For data encryption needs, a secured encryption key catalog is also used. It is named hydra catalog as it uses a split key storage strategy to improve security and fault tolerance [15,14]. An Abstraction layer, currently being prototyped and tested, is also depicted on the diagram. Its role is to offer a higher level abstraction for accessing 3D images by associating all DICOM slices corresponding to a single volume. Indeed, most medical image processing applications are not manipulating 2D images independently but rather consider complete volumes. The abstraction layer is associating a single GUID to each volume. On a request for the volume associated to this GUID, all corresponding slices are transferred from the DICOM server and assembled in a single volume in scratch space. 6 ARDA
metadata dev/metadata/
catalog
project,
http://project-arda-dev.web.cern.ch/project-arda-
20
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
DICOM imager (client)
Trigger PUSH
DICOM server
Fireman
AMGA
Hydra key
File Catalog
metadata server
store
PUSH Analyse DICOM header (build SURL) register SURL write metadata generate encryption key store key
Figure 2. Triggered action at image creation
4.3. Internal service interaction patterns To fulfill its role, the MDM service needs to be notified when files are produced by the imagers and stored into the DICOM server. This notification triggers a file registration procedure that is depicted in figure 2. The DICOM data triggering the operation is first stored into the hospital DICOM server as usual. The DICOM header is then analyzed to extract image identifying information. This DICOM ID is used to build a Storage URL (SURL) as used by the grid File Catalog to locate files. The SURL is registered into the File Catalog and a GUID associated to this data on the grid side. The other metadata extracted from the DICOM header are stored into the AMGA metadata server. Finally, encryption keys that are associated to the file and that will be used for data retrieval are stored into the hydra distributed database. Once DICOM data sets have been registered into the MDM, the server is able to deliver requested data to the grid as depicted in figure 3. A client library is used for this purpose. To cover all application use cases, the MDM client library provides APIs for requesting files based on their grid identifier (GUID) or the metadata attached to the file. In case of request on the metadata, a database query is first made to the AMGA server and the list of GUIDs of images matching the query are returned. The SRM-DICOM server can then deliver images requested through their GUID. SRM get requests are translated into DICOM get queries. Data extracted from the DICOM server are first written to an internal scratch space. Their format is transformed into a simple 3D image file format (a human readable header including image size and encoding, followed by the raw image data). In this transformation, the DICOM header, containing patient identifying operations, are lost to preserve anonymity. The files are also encrypted before being sent out to ensure that no sensitive information is never transfered nor stored on the grid in a readable format. Files are then transferred through the gLiteIO service and returned to the client in an encrypted form. The file is only decrypted in memory of the client host, given that the client is authorized to access the file encryption keys.
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
AMGA metadata
MDM client library
gLiteI/O server
Fireman File Catalog
SRM−DICOM core
21
DICOM server
Query on metadata GUIDs gLiteI/O(GUID) hydra keystore
get SURL prepare(SURL)
DICOM GET
get key
convert file encrypt file
scratch space
write file encrypted file
fileI/O
get key decrypt file
Figure 3. Accessing DICOM images
4.4. MDM client On the client side, three levels of interfaces are available to access and manipulate the data hold by the MDM. The MDM is seen from the middleware as any storage resource exposing a standard SRM interface, the standard data management client interface can be used to access images provided that their GUID is known. The files retrieved using this standard interface are encrypted. The second interface is an extra middleware layer which encompasses access to the encryption key and the SRM. Thus images can be fetched and decrypted locally. The third and last level of interface is the fully MDM aware client library represented in figure 3. It provides access to encrypted files and in-memory decryption of the data on the application side, plus access to the metadata through the AMGA client interface.
5. Discussion 5.1. Data security The security model of the MDM relies on several services: (i) file access control, (ii) files anonymization, (iii) files encryption, and (iv) secured access to metadata. The user is coherently identified through a single X509 certificate and all services involved in security are using the same identification procedure. The file access control is enforced by the gLiteIO service which accepts Access Control Lists (ACLs) for fine grained access control. The hydra key store and the AMGA metadata services also accept ACLs. To read an image content, a user needs to be authorized both to access the file and to the encryption key. The access rights to the sensitive metadata associated to the files are administrated independently. Thus, it is possible to grant access to an encrypted file only (e.g. for replicating a file without accessing to the content), to the file content (e.g. for processing the
22
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
data without revealing the patient identity), or to the full file metadata (e.g. for medical usage). Through ACLs, it is possible to implement complex use cases, granting access rights (for listing, reading, or writing) to patients, physicians, healthcare practitioners, or researchers needing to process medical data, independently from each other. 5.2. Medical metadata schema A minimal metadata schema is defined in the MDM service for all images stored. It provides basic information on the patient owning the image, the image properties, acquisition parameters, etc. There are two main indexes used: a patient ID, for all nominative information associated to patients and the image GUID for all information associated to images. The patient ID is a unique but irreversible field (such as a MD5 sum on the patient field name). Four main relational tables are used: • The Patient table, indexed on the patient ID, contains the most sensitive identifying data (patient name, sex, date of birth, etc). • The Image table, indexed on the image GUID, contains technical information about the image (size, encoding, etc). It establishes a relation with the patient ID. • The Medical table, indexed on the image GUID, contains additional information on the acquisition (image modality, acquisition place and date, radiologists, etc). • The DICOM table, indexed on the image GUID, contains the image DICOM identifiers used for querying the DICOM server. To remain extensible, an additional Protocol table associates image GUIDs with medical protocol name. Through AMGA, the user can create as many medical protocols as needed, containing specific information related to some particular acquisition (e.g. a temporal protocol for cardiac acquisitions, etc). AMGA also enables per table access right control, allowing restricting access to the most sensitive data (e.g. the Patient table) to the minimum number of users.
6. Testbed The Medical Data Manager has been deployed on several sites for testing purposes. Three sites are actually holding data in three DICOM servers installed at I3S (Sophia Antipolis, France), LAL (Orsay, France) and CREATIS (Lyon, France). In addition to the DICOM servers, these sites have installed the core MDM services: a SRM-DICOM server and associated database back-end, a gLiteIO service, a GridFTP service, and all dependencies in the gLite middleware. Client have been deployed on all these three sites. To complete the installation, an AMGA catalog has also been set up in CREATIS (Lyon) for holding all sites’ metadata, and an hydra key store is deployed at CERN (Geneva, Switzerland) for keeping file encryption keys.
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
23
Given the number of services involved, the installation and configuration procedure is currently complex. It is being worked out to ease the testbed extension. The MDM service should be deployed in hospitals where little support is provided for the informatics infrastructure. The testbed deployed has been used to demonstrate the viability of the service by registering and retrieving DICOM files across sites. For testing purposes, DICOM data registrations are triggered by hand. Registered files could be retrieved and used from EGEE grid nodes transparently, using the standard EGEE data management interface. The next important milestone will be to experiment the system in connection with hospitals by registering real clinical data freshly acquired and registered on the fly from the hospital imagers. This step involves entering a more complex clinical protocol with strong guarantee on the data privacy protection. The security cannot be neglected at any level at this point.
7. Conclusion and future work The Medical Data Manager service presented in this paper is an important milestone for enabling medical image processing applications on a grid infrastructure. Its main strength are: • To access medical databases without interfering with clinical practice. Data are kept on clinical sites and transparently transferred to the grid only when needed. • To expose standard interfaces to other grid services. The MDM is fully integrated in the gLite middleware. • To ensure a high level of security to preserve patients privacy. The MDM prototype was successfully deployed and tested in a controlled computing environment. The next step will see interfacing to medical imagers inside hospitals. It will require to simplify the installation and configuration procedures as most as possible. The core MDM development is not finished yet and additional functionalities will be included to enrich the service. In particular, the abstraction layer depicted in figure 4.2 will soon be available. Applications will then be able to retrieve 3D volume files rather than single slices. In addition, metadata are expected to be distributed in the different clinical sites where data are acquired rather than being centralized as it is the case in our testbed. This configuration will be more acceptable to the clinical world to keep control on the hospital data. It will require deploying several AMGA servers on different sites and exposing a centralized query service able to retrieve data from these different servers.
Acknowledgments We are grateful to the EGEE European IST project for providing resources and support to this service development.
24
J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware
References [1] R. Acharya, R. Wasserman, J. Sevens, and C. Hinojosa. Biomedical Imaging Modalities: a Tutorial. Computerized Medical Imaging and Graphics, 19(1):3–25, 1995. [2] K.P. Andriole, R.L. Morin, Arenson; R.L., J.A. Carrino, B.J. Erickson, S.C. Horii, D.W. Piraino, B.I. Reiner, J.A. Seibert, and E. Siegel. Addressing the Coming Radiology Crisis: The Society for Computer Applications in Radiology SCAR Transforming the Radiological Interpretation Process (TRIP) initiative. Journal of Digital Imaging, 17(4):235–243, December 2004. [3] C. Barillot, R. Valabregue, J.P. Matsumoto, F. Aubry, H. Benali, Y. Cointepas, O. Dameron, M. Dojat, E. Duchesnay, B. Gibaud, S. Kinkingn´ehun, D. Papadopoulos, M. P´el´egrini-Issac, and E. Simon. NeuroBase: Management of Distributed and Heterogeneous Information Sources inNeuroimaging. In Distributed Database and processing in Medical Image Computing workshop (DiDaMIC’04), Saint Malo, France, September 2004. [4] I. Blanquer Espert, V. Hern´ andez Garc´ia, and J.D. Segrelles Quilis. Creating Virtual Storages and Searching DICOM Medical Images through a GRID Middleware based in OGSA. Journal of Clinical Monitoring and Computing, 19(4-5):295–305, October 2005. [5] D. Budgen, M. Turner, I. Kotsiopoulos, F. Zhu, K. Bennett, P. Brereton, J. Keane, P. Layzell, M. Russell, and M. Rigby. Managing healthcare information: the role of the broker. In Healthgrid’05, Oxford, UK, April 2005. [6] DICOM: Digital Imaging and COmmunications in Medicine. http://medical.nema.org/. [7] M.H. Ellisman, C. Baru, J.S. Grethe, A. Gupta, M. James, B. Ludaescher, M.E. Martone, P.M. Papadopoulos, S.T. Peltier, A. Rajasekar, S. Santini, and I.N. Zaslavsky. Biomedical Informatics Research Network: An Overview. In Healthgrid’05, Oxford, UK, April 2005. [8] C. GERMAIN, V. BRETON, P. CLARYSSE, Y. GAUDEAU, T. GLATARD, E. JEANNOT, Y. LEGRE, C. LOOMIS, I. E. MAGNIN, J. MONTAGNAT, J.-M. Moureaux, A. OSORIO, X. PENNEC, and R. TEXIER. Grid-enabling medical image analysis. Journal of Clinical Monitoring and Computing, 19(4-5):339–349, October 2005. [9] S. Hastings, S. Oster, S. Langella, T.M. Kurc, T. Pan, U.V. Catalyurek, and Saltz J.H. A Grid-based image archival and analysis system. Journal of the American Medical Informatics Association, 12:286–295, January 2005. [10] H. K. Huang. PACS: Picture Archiving and Communication Systems in Biomedical Imaging. Hardcover, 1996. [11] J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legr´e, I.E. Magnin, L. Maigne, S. Miguet, J.-M. Pierson, L. Seitz, and T. Tweed. Medical images simulation, storage, and processing on the european datagrid testbed. Journal of Grid Computing, 2(4):387–400, December 2004. [12] J. Montagnat, V. Breton, and I.E. Magnin. Using grid technologies to face medical image analysis challenges. In Biogrid’03, proceedings of the IEEE CCGrid03, Tokyo, Japan, May 2003. [13] N. Santos and B. Koblitz. Metadata services on the grid. In Advanced Computing and Analysis Techniques, Berlin, Germany, May 2005. [14] L. Seitz, J.M. Pierson, and L. Brunie. Key management for encrypted data storage in distributed systems. In IEEE Security in Storage Workshop (SISW), Washington DC, USA, October 2003. [15] L. Seitz, J.M. Pierson, and L. Brunie. Encrypted storage of medical data on a grid. Methods of Information in Medicine, 44(2), 2005.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
25
Grid Scheduling for Interactive Analysis Cécile Germain-Renaud a,1 , Romain Texier a Angel Osorio b and Charles Loomis c a Laboratoire de Recherche en Informatique b Laboratoire d’informatique et de mécanique pour les sciences de l’ingénieur c Laboratoire de l’Accélérateur Linéaire Abstract. Grids are facing the challenge of moving from batch systems to interactive computing. In the 70s, standalone computer systems have met this challenge, and this was the starting point of pervasive computing. Meeting this challenge will allow grids to be the infrastructure for ambient intelligence and ubiquitous computing. This paper shows that EGEE, the largest world grid, does not yet provide the services required for interactive computing, but that it is amenable to this evolution through relatively modest middleware evolution. A case study on medical image analysis exemplifies the particular needs of ultra-short jobs. Keywords. Medical Image Analysis Grid Middleware Scheduling
1. Introduction In the 70s, the transition from batch systems to interactive computing has been the enabling tool for the widespread diffusion of advances in IC technology. Grids are facing the same challenge. The exponential coefficients in network performance [7] enable the virtualization and pooling of processors and storage. In the field of biomedical application, widespread diffusion of grid technology might require seamless integration of the grid power into everyday use. In the more specific area of medical image processing, algorithms often involve a visual evaluation or exploration of the results. In some cases (e.g. rigid registration of multimodal images of the same patient), algorithms are sufficiently automatic to be executed remotely without interaction and the results sent for visualization to the user. In other cases, such as inter-subject registration, it may be necessary to use the anatomical knowledge of the user to better define the expected result (anatomical correspondences between cortical areas in the brain are loosely defined). In such a case, the interaction may be limited to an alternation of independent distant computations and user correction requests, but a soft real-time interaction would be much more interesting. A last class of image processing algorithm, like pre-operative planning, deeply involves the user and requires at least soft real-time to be really useful. However, the need for fast turnaround time on the grid is not limited to medical image analysis, but encompasses all situations of display-action loop, ranging from a test and debug process on the exploration of databases , to computational steering through virtual/augmented reality interfaces, as well as portal access to grid resources, or complex 1 Correspondence
to: Cécile Germain-Renaud, LRI Université Paris-Sud. ; E-mail: cecile.germain@lri.fr
26
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
and partially local workflows. A critical system requirement is thus the need to move Grids from exclusive batch-oriented processing to general purpose processing, including interactive tasks. Section 2 of this paper will provide experimental evidence about the reality of the need, from the analysis of the activity of a segment of the EGEE grid heavily used by its biomedical community, the biomed VO. The next question is then a strategy to support interactive jobs on a grid. Virtual machines provide a powerful new layer of abstraction in distributed computing environments [5,2]. The freedom of scheduling and even migrating an entire OS and associated computations considerably eases the coexistence of deadline bound short jobs and long running batch jobs. However, a production grid is a prerequisite for biomedical and clinical potential users. One of the goals of the AGIR project [3] is to interact with production grids in order to define and implement the new grid services required by medical image analysis, with the EGEE grid as an important target. Section 3 of this paper presents some advances towards this goal, in the area of grid scheduling. The EGEE execution model is not based on such virtual machines, thus the scheduling issues must be addressed through the standard middleware components, broker and local schedulers. We demonstrate that QoS and fast turnaround time can be supported by a production grid.
2. EGEE usage The current use of EGEE makes a strong case for a specific support for short jobs. Through the analysis of the LB log of a broker, we can propose quantitative data to support this affirmation. The broker logged is grid09.lal.in2p3.fr, running successive versions of LCG; the trace covers one year (October 2004 to October 2005), with 66 distinct users and more than 90000 successful jobs, all production. This trace provides both the job intrinsic execution time t (evaluated as the timestamp of event 10/LRMS minus the timestamp of event 8/LRMS), and the makespan m, that is the time from submission to completion (evaluated as the timestamp of event 10/LogMonitor minus the timestamp of event 17/UI). The intrinsic execution time might be overestimated if the sites where the job is run accept concurrent execution. Fig. 1 shows the histogram of intrinsic execution times. The striking fact is the very large number of extremely short jobs. We call Short Deadline Jobs (SDJ) those where t < 10 minutes, and Medium Jobs (MJ) those with t between ten minutes and one hour. SDJ consume nearly 20% of the total execution time, in the same range as jobs with t less than one hour (17%). Fig. 2 plots the overhead ratio or as a function of the execution time t. The overhead ratio or is formally defined as (m − t)/t. Its interpretation is the ratio of the overhead o = m − t (which is the time spent "in the system") to the actual execution time t. The components of the overhead are twofold: Queuing time , which depends on the jobs submitted by other grid users. It is a scheduling policy issue. Middleware penalty : these are the various delays incurred along a job lifecycle because of the job management system, which is the cost of traversal of the middleware protocol stack. Here, the issue is the efficiency of the middleware implementation.
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
27
1,00E+05
Number of jobs
1,00E+04
1,00E+03
1,00E+02
1,00E+01
0, 00 E 6, +0 00 0 E 1, +0 20 1 E 1, +0 80 2 E 2, +0 40 2 E 3, +0 00 2 E 3, +0 60 2 E 4, +0 20 2 E 4, +0 80 2 E 5, +0 40 2 E 6, +0 00 2 E 6, +0 60 2 E 7, +0 20 2 E 7, +0 80 2 E 8, +0 40 2 E 9, +0 00 2 E 3, +0 60 2 E 7, +0 20 3 E 1, +0 08 3 E+ 04
1,00E+00
Execution time (seconds)
Figure 1. Distribution of execution times
Figure 2. The overhead ratio as a function of execution time - Execution time in seconds, overhead ratio dimensionless (see text for explanation)
The two components are orthogonal: even with a perfect middleware, if, for instance, the jobs were served on a first-come-first served basis, a job will be queued (and thus have to wait) until all its predecessors have been served (note that the EGEE basic scheduling scheme is more complicated). Thus, limiting the delays created by these two components must be addressed separatly, as shown in the next section. However, the first information provided by fig 2 is that, for SDJ, the overhead is often many orders of magnitude superior to t. This is absolutely dissuasive for gridenabling SDJ. For MJ, the overhead is of the same order of magnitude as t. Thus, the EGEE service for SDJ is seriously insufficient. One could argue that bundling many SDJ into one MJ could lower the overhead. However, interactivity will not be reached, because results will also come in a bundle: for graphical interactivity, the result must obviously be pipelined with visualization; in the test-debug-correct cycle, there might be not very many jobs to run.
28
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
With respect to grid management, an interactivity situation translates into a QoS requirement: just as video rendering or music playing requires special scheduling on a personal computer, or video streaming requires network differentiated services, servicing SDJ requires a specific grid guarantee, namely a small bound on the makespan, which is usually known as a deadline in the framework of QoS.
3. Scheduling for interactivity 3.1. A Scheduling Policy for SDJ Deadline scheduling usually relies on the concept of breaking the allocation of resources into quanta, of time for a processor, or through packet slots for network routing. For job scheduling, the problem is a priori much more difficult, because jobs are not partitionable: except for checkpointable jobs, a job that has started running cannot be suspended and restarted later. Condor [6] has pioneered migration-based environments, which provide such a feature transparently, but deploying constrained suspension in EGEE would be much too invasive, with respect to existing middleware. Thus, SDJ should not be queued at all, which seems to be incompatible with the most basic mechanism of grid scheduling policies. The EGEE scheduling policy is largely decentralized: all queues are located on the sites, and the actual time scheduling is enacted by the local schedulers. Most often, these schedulers do not allow time-sharing (except for monitoring). The key for servicing SDJ is to allow controlled time-sharing, which transparently leverages the kernel multiplexing to jobs, through a combination of processor virtualization and slot permanent reservation. The SDJ scheduling system has two components. • A local component, composed of dedicated queues and a configuration of the local scheduler. Technical details for MAUI can be found at [11]. It ensures that: ∗ Immediate execution of SDJ if resource are available. ∗ The delay incurred by batch jobs has a fixed multiplicative bound. ∗ The policy is work-conserving, implying that the resource usage is not degraded, eg by idling processors. ∗ The policies governing resource sharing (VOs, EGEE and non EGEE users,...) are not impacted. • A global component, composed of job typing and mapping policy at the broker level. While it is easy to ensure that SDJ are directed to resources accepting SDJ, LCG and gLite do not provide the means to prevent non-SDJ jobs from using the SDJ queues, and this requires a minor modification of the EGEE Workload Management System. For the local component, the first question is to prove correctness. Extensive experiments have been conducted on the EGEE cluster at LAL. Fig. 3 (a) shows a case where three kind of jobs are allowed to run concurrently: batch, SDJ, and dteam. On a dual-processor, only two of each kind actually runs, which ensures bounded delay. Fig 3(b) gives the overall site view; the fraction intended limitation of SDJ-dedicated resources (10 running jobs maximum) is achieved.
29
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis 40
6
35 5
Number of jobs
maxjob max sdj max batch max dteam
3
Number of jobs
30 4
25 job sdj batch dteam
20 15
2 10 5
1
0 58000
0 58000 58200
58400
58600
58800
59000
59200
58200
58400
58600
58800
59000
59200
Time (s)
Figure 3. Local scheduling (a) on one machine (dual-processor) (b) on a site
For the global component, the long-term technical solution would be a modification of the Glue Schema. This schema is the information model currently used by EGEE, Open Science Grid, and many other grid projects. In this schema, the target of a job is a Computing Elements (CE), which is mainly a site queue so far. Thus, a new CE attribute (eg QueueAttribute) should be created with the following functions: publish that this queue accepts SDJ jobs and only them. However, the operational use of the Glue schema as a common ground for interoperability between international grids makes its evolution a long process (even if it can be expected to be satisfied in the medium term, because the requirement for this category of attribute meets other ones of the same type, for instance for MPI jobs). Thus, a short term solution has been set up: on one hand, a boolean attribute in the JDL (SDJ) is created; on the other hand, CE dedicated to SDJ must have a name suffixed by ".sdj"; the user interface will translate the boolean attribute towards a regular expression is the JDL requirement RegExp("*sdj$",other.GlueCEUniqueID); finally, the WMS will interpret the lack of this requirement as a prescription not to direct a job to the sdj-suffixed CE. These features will be integrated in gLite 3.2. It must be noticed that no explicit user reservation is required: seamless integration also means that explicit advance reservation is no more applicable than it would be for accessing a personal computer or a video-on-demand service. In the most frequent case, SDJ will run with under the best effort Linux scheduling policy (SCHED_OTHER); however, if hard real-time constraints must be met, this scheme is fully compatible with preemption (SCHED_FIFO or SCHED_RR policies). In any case, the limits on resource usage(e.g. as enforced by Maui) implement access control; thus the job might be rejected. The WMS notifies rejection to the application, which could decide on the most adequate reaction, for instance submission as a normal job or switching to local computation. 3.2. User-level scheduling Considering the grid middleware penalty for submission, scheduling and mapping of jobs, it cannot be reasonably hoped to reach the order of second, which would be needed for ultra-small jobs, such those considered in the next section. With the most recent and tuned EGEE middleware (gLite 3.0), the middleware penalty remains in the order of minutes. In the gPTM3D project [4], we have shown that an additional layer of userlevel scheduling provides a solution which is fully compatible with EGEE organization of sharing.
30
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
Figure 4. gPTM3D architecture
4. gPTM3D 4.1. Interactive Volume Reconstruction PTM3D [9] is a fully featured DICOM images analyzer developed at LIMSI. PTM3D transfers, archives and visualizes DICOM-encoded data; besides moving independently along the usual three axes, the user is able to view the cross-section of the DICOM image along an arbitrary plane and to move it. PTM3D provides computer-aided generation of three-dimensional representations from CT, MRI, PET-scan, or echography 3D data. A reconstructed volume (organ, tumor) is displayed inside the 3D view. The reconstruction also provides the volume measurement, required for therapeutic decisions. The system currently runs on standard PC computers and it is used on line in radiology centres. Clinical motivation for grid-enabled volume reconstruction is described in [4]. The first step in grid-enabling PTM3D (gPTM3D) is to speedup compute-intensive tasks, such as the volume reconstruction of the whole body used in percutaneous nephrolithotomy planning. The volume reconstruction module has been coupled with EGEE with the following results: • the overall response time is compatible with user requirements (less than 2 minutes), while the sequential time on a 3GHz, 2MB memory PC is typically 20 minutes. • the local interaction scheme (stop, restart, improve the segmentation) remains strictly unmodified. This first step has implemented fine grain parallelism and data-flow execution on top on a large scale and file-oriented grid system. The architecture based on Application Level Scheduler/Worker agents shown in fig 4 is fully functional on EGEE. The Interaction Bridge (IB) acts as a proxy in-between the PTM3D workstation, which is not EGEE-enabled, and the EGEE world. When opening an interactive session, the PTM3D workstation connects to the IB; in turn, the IB launches a scheduler and a set of workers on an EGEE node, through fully standard requests to an EGEE User Interface; a stream is established between the scheduler and the PTM3D front-end through the IB. When
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
31
the actual volume reconstruction is required, the scheduler receives contours; the Scheduler/Worker agents follow a pull model, each worker computing one slice of the reconstructed volume at a time, and sending it back to the scheduler, which forwards them to IB from where they finally reach the front-end. The next step will be to implement a scheme where the IB and the scheduler cooperate to respectively define and enforce a soft real-time schedule. User-level scheduling has been proposed in many other contexts, and a case for it has been made in the AppleS [1] project. In a production grid framework, the Dirac [10] project has proposed a permanent grid overlay where scheduling agents pull work from a central dispatching component. Our work differs from Dirac in two respects: first, the scheduling and execution agents are launched just as any EGEE job, and are thus subject to all regulations related to sharing: typically, they are SDJ, thus will be aborted if they exceed the limits of this type of jobs.Moreover, they work in connected mode, more like glogin-based applications [8]. 4.2. Grid-enabling Image Exploration In the previous section, the grid was used only as a provider of computing power, while the data were located on the front-end. Sharing data is a well-known need for algorithmic research, but this is true for clinical research as well. We have started the process of extending PTM3D toward accepting remote data access. The integration of gPTM3D with the Medical Data Management scheme presented in another paper is the final goal. However, at the present time, we consider a most restrictive scheme, which uses the internal format of PTM3D images, where the slices are bundled in a 3D file (bdi and bdg formats). In this context, the main issues are the access latency. The ongoing work targets adaptation to the user activity, mainly trough interactive selection of the image resolution and the region of interest.
5. An architecture for grid interactivity The scheme described in the previous sections virtualizes the resources at the coarse grain of batch versus short deadline jobs. An open issue is scheduling across SDJ. Consider for instance a portal, where many users ask for a continuous stream of execution of SDJ. This situation can be modelled with the so-called (period, slice) model used in soft real-time scheduling, where a fraction (slice) of each period of time should be allocated to each user in order to keep happy. To be coherent with a software architecture based on VOs, global regulation of SDJ should be left to the implementation of sharing policies (ultimately implemented by site schedulers). However, it is the responsibility of the provider of a particular service to arbitrate between its users. The Interaction Bridge described in the previous section is the adequate location for this arbitration. Figure 5 describes the resulting architecture.
Acknowledgements This work was partially funded by ACI Masses de Données AGIR. We thank Fabrizio Pacini of EGEE JRA1 for his help with the Glue schema specification.
32
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
,QWHUDFWLRQ%ULGJH 73 8VHU,QWHUIDFH
,QWHUDFWLRQ%ULGJH
-66
%URNHU
1RGH &OXVWHU 6FKHGXOHU
&(
8VHU,QWHUIDFH
8VHU,QWHUIDFH
%URNHU
7DVNSULRULWL]DWLRQ
0DWFKPDNLQJ
3HUPDQHQWUHVHUYDWLRQ RQYLUWXDOSURFHVVRUV 7UDQVSDUHQWZKHQXQXVHG
Figure 5. A two-level scheduling architecture
References [1] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS parameter sweep template: user-level middleware for the grid. In Procs 2000 ACM/IEEE conference on Supercomputing (CDROM), 2000. [2] A. Bavier et al. Operating Systems Support for Planetary-Scale Network Services. In Procs. 1st Symp. on Networked System Design and Implementation (NSDI Š04), 2004. [3] C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legré, C. Loomis, J. Montagnat, J-M Moureaux, A. Osorio, and X. Pennec et R. Texier. Grid-enabling medical image analysis. Journal of Clinical Monitoring and Computing, 19(4-5):339–349, 2005. Extended version of the BioGrid 2005 paper. [4] C. Germain, R. Texier, and A. Osorio. Exploration of Medical Images on the Grid. Methods of Information in Medecine, 44(2):227–232, 2005. [5] B. Lin and P. A. Dinda. VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. [6] M. J. Litzkow, M. Livny, and M. W. Mutka. Condor : A hunter of idle workstations. In 8th International Conference on Distributed Computing Systems, pages 104–111. IEEE Computer Society Press, 1988. [7] L. G. Roberts. Beyond Moore’s law: Internet growth trends. Computer, 3(1):117–119, 2000. [8] H. Rosmanith and D. Kranzlmuller. glogin - A Multifunctional, Interactive Tunnel into the Grid. In Procs 5th IEEE/ACM Int. Workshop on Grid Computing (GRID’04), 2004. [9] V. Servois, A. Osorio, and J. Atif et al. A new pc based software for prostatic 3d segmentation and volume measurement. application to permanent prostate brachytherapy (ppb) evaluation using ct and mr images fusion. InfoRAD 2002 - RSNA’02, 2002.
C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis
33
[10] A. Tsaregorodtsev, V. Garonne, and I. Stokes-Rees. DIRAC: A Scalable Lightweight Architecture for High Throughput Computing. In Procs 5th IEEE/ACM Int. Workshop on Grid Computing (GRID’04), 2004. [11] SDJ WG wiki site. http://egee-na4.ct.infn.it/wiki/index.php/ShortJobs.
34
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture: A Web Portal Design F. BELLET, I. NISTOREANU, C. PERA, H. BENOIT-CATTIN CREATIS, UMR CNRS #5515, U 630 Inserm Université Claude Bernard Lyon, INSA, Lyon Bât. B. Pascal, 69621 Villeurbanne, FRANCE
Abstract. In this paper, we present a web portal that enables simulation of MRI images on the grid. Such simulations are done using the SIMRI MRI simulator that is implemented on the grid using MPI and the LCG2 middleware. MRI simulations are mainly used to study MRI sequence, and to validate image processing algorithms. As MRI simulation is computationally very expensive, grid technologies appear to be a real added value for the MRI simulation task. Nevertheless the grid access should be simplified to enable final user running MRI simulations. That is why we develop this specific web portal to propose a user friendly interface for MRI simulation on the grid. The web portal is designed using a three layers client/server architecture. Its main component is the process layer part that manages the simulation jobs. This part is mainly based on a java thread that screens a data base of simulation jobs. The thread submits the new jobs to the grid and updates the status of the running jobs. When a job is terminated, the thread sends the simulated image to the user. Through a client web interface, the user can submit new simulation jobs, get a detailed status of the running jobs, have the history of all the terminated jobs as well as their status and corresponding simulated image.
Keywords. Web portal, Grid application, MRI simulation
1. Introduction The simulation of Magnetic Resonance Imaging (MRI) is an important counterpart to MRI acquisitions [1]. Simulation is naturally suited to acquire theoretical understanding of the complex MR technology. It is used as an educational tool in medical and technical environments [2]. By offering an analysis independent of the multiple parameters involved in the MR technology, MRI simulation permits the investigation of artifact causes and effects [3]. Simulation may also help in the development and optimization of MR sequences [4]. Finally MRI simulator provides an interesting assessment tool of image processing techniques [5] since it generates 3D realistic images from medical virtual objects perfectly known. The SIMRI simulator is a recent 3D MRI advanced simulator [1] that integrates in a unique simulator most of the simulation features that are offered by different
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
35
simulators. It takes into account the main static field value (Figure 1) and enables realistic simulations of the chemical shift artifact including off-resonance phenomena. It also simulates the artifacts linked to the static field inhomogeneity like those induced by susceptibility variation within an object (Figure 2). It is implemented in C language and distributed under the CECILL public license . The MRI sequence programming is done using high level C functions with a simple programming interface. To manage large simulations, the magnetization kernel is implemented in a parallelized way that enables simulation on PC grid architecture [6] using a standard Message Passing Interface (MPI) API.
Figure 1. 256x256 simulated brain image at 1.5 T with SIMRI: Spin Echo sequence (TE=25 ms TR=500 ms BW=25.6 kHz). As simulation of the MR physics is computationally very expensive, parallel implementation is mandatory to achieve performances compatible with the target applications[4]. As an example it takes 12 hours to simulate a 512² image on a recent PC. This time has to be multiplied by 16 for a 1024² image. In 3D, simulation of a 5123 volume would require 100 years of CPU use! Thanks to the linearity property of the main computation task, the simulation job can be distributed easily with a reduced communications between nodes during simulation [6] and consequently a good scalability. As a consequence, the computation time is reduced in proportion with the available computation nodes. In this context, by offering a virtually unlimited computing power, grid technologies appear to be a real added value for the MRI simulation task [7]. Nevertheless the grid access should be simplified to enable the final users to run MRI simulations. That is why we develop a specific web portal to propose a user friendly interface for MRI simulation on the grid.
36
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
Figure 2. Illustration of the susceptibility artifact on an air bubble into water with a static field of 7T. 256x256 simulated image obtained with SIMRI and a Spin Echo sequence (TE=20 ms, TR=1000 ms, BW=20 kHz). This paper presents the MRI simulation web portal (Simri@Web) we developed. We present the general architecture of the portal, and we detail the data layer, the process layer and the client user interface.
2. The Simri@Web Web portal 2.1. Functionalities and technical environment The aim of the web portal is to mask to the final users the grid middleware and to provide new functionalities. We target the following: - Access to all simulation parameters: At least 10 parameters including the sequence name, the sequence parameters, the image size, the main field value… - Access to two simulation targets: the EGEE grid and the local cluster of our Lab - Providing a personal account with authentication and history of all the simulation jobs with the corresponding simulated images, the terminated job status history and the running job status. - Providing the simulated images by email. Concerning the technical environment, the jobs must be submitted on EGEE using the LCG2 grid middleware and on the local cluster using the PBS batch manager. The web portal runs on a web server Apache v 2.0.54 with the PHP5 module associated to the libraries libssh2.so and mysql.so. We use for the data layer (see section 2.3) MySql v.4 and for the process layer (see section 2.4) Java 1.4.2 with the class sets jsch.jar and mysql-connector-java-3.jar. Finally, note that the SIMRI code is compiled with the MPI library.
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
Web server
37
Targeted platform
Job submission
Client JAVA Thread Job add
Simri JobServer
Figure 3. Illustration of job ' travel' within the three layers architecture. The simri job server corresponds to the data layer, the web server and the Java thread are the two parts of the process layer. The client represents the presentation layer. 2.2. Architecture overview The web portal architecture is a client/server architecture divided into three layers of services (Figure 3): - The presentation layer that includes the client graphical interface, some local process to check the user inputs, some data display. - The process layer that is in charge of the application processes including the simulation job management and a dynamic web page generation. - The data layer that manages all the data. This layer is called when a data access is required. 2.3. Data Layer This layer is a MySql data base server that must guarantee the persistence of all the application data like the data relative to the users and those relative to the simulation jobs. This layer is the only communication gateway (Figure 3) between the two parts of the process layer defined below. The corresponding data base is defined by five tables (Figure 4): - The user table contains all the personal user data like their email, their labs (…) and their access right (yes/no) to the cluster and the grid. Accesses are granted or denied by an administrator user. - The job tables (one for the cluster and one for the grid) contain a job id, the user id, the associated simulated image name and the start and stop time of the job that are updated by the process layer.
38
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
-
The job parameter table contains all the simulation parameters associated to a job. The job status table contains for each job id all the status reached by the job.
Figure 4. The five tables of the data layer. 2.4. Process Layer The process layer is the core of the web portal. It takes into account all the application logic. In our case, the process layer dynamically generates the web pages, collects all the information, manages the target platform connection, the job submission, and collects the simulated images. We chose to separate this layer into two separated parts: One dedicated to the user management and one to the job management. 2.4.1. Process Layer: User management The user management (UM) process layer corresponds to the server side that runs on Apache. It is written in PHP. Each time a user connects to the portal, this layer starts a specific session, checks the user identity in the data layer, and offers to the user a personal space composed of three main pages. The first one concerns a new job submission (Figure 6). In this page, the UM layer collects the data from the client side to fill the data layer (Figure 3). The two other pages concern the running jobs (Figure 7) and the ended jobs. This two pages are dynamically generated by the UM layer and filled with the data collected in the data layer.
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
39
Grid UI at in2p3.fr
JobGridThread Simri JobServer Reading submited jobs For all jobs (Id, Status) submited Jobs
Edg-job-satus Id Status job ? If new satus
Updating status job table ? If satus=done Edg-get-output ID
Image folder User mailing
scp image
Updating grid jobs table
Close connection Close connection
Figure 5. Illustration of the grid JavaThread chronogram. The thread interacts with the data layer to get the new jobs and it interacts with the grid to submit the jobs, get their status and get the simulated images. 2.4.2. Process Layer: Job Management The job management (JM) process layer is written in Java. It assumes the job submission, the job status and simulated image recuperation. The JM layer is composed of two Java threads: One dedicated to the interaction with the local cluster based on PBS and one with the EGEE grid based on LCG2. Figure 5 gives the chronogram of the grid Java thread. Each time it wakes up, it consults the data layer to get the new jobs. It submits the eventual new jobs. It gets the status of all the submitted jobs and updates consequently the job status in the data layer (Figure 3). For all the terminated jobs, it gets the simulated image and sends to the corresponding user by mail a web link on the image. This split of the process layer is very efficient. The PHP limits are cancelled as it is only used for the UM where it is very efficient. Indeed all possible timeouts while communicating with the grid are handled by the java thread and not by the web PHP scripts. Consequently the users are never impacted by such troubles. The web server is only used for the UM. All the code linked to the simulation process is located in the threads and is easy to maintain.
40
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
Figure 6. Job submission client interface. The user can choose the target platform (local cluster or EGEE grid). 2.5. Presentation layer: Client user interface With this three independent layers architecture, the client side is very light. It manages only the application presentation with an application logical part to check the user inputs. Consequently, the client side is a simple web browser plus a logical part written in Java scripts. The web browser displays only five different pages: The registration page, the connection page, the job submission page (Figure 6), the running job page (Figure 7) and the ended job page. This page are dynamically generated by the UM layer.
Figure 7. Running jobs client interface and a window that gives the status history of a terminated job.
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
41
3. Conclusion This Simri@Web portal is effective since September 2005. At the moment, it is opened only to the 6 persons involved in the SIMRI project. After 300 simulations, we observed a job failure rate on the grid of about 20 %. This rate is mainly due to some non homogeneous implementation of MPI on few computing elements (clusters) of the grid and probably to some bad scheduling policies at the grid level. These problems have been reported and are under investigation. At the moment, we don't allow the user to choose the number of requested MPI nodes because this number has a direct impact on the number of available computing elements of the grid. Indeed within LCG2 an MPI job can not spam across several computing elements. So we fix the node number to modest value (12) that provides us more available computing elements and hopefully a quicker simulation result for the user. Such a web interface corresponds perfectly to the type of interface wanted by the end users who appreciate the middleware and batch manager masking, the user account as well as the complementary services. Nevertheless, we plan to develop a new web portal architecture that would use the web service technologies and the Glite middleware that has been recently chosen for the EGEE grid. We target a versatile and open architecture to be able to add more easily in the portal new simulation targets like the CINES1 machines and to add other MRI simulation codes like the one linked to susceptibility effect [8]. Finally, this architecture will integrate a data management service to store the high value simulated images with their corresponding simulation parameter.
4. Acknowledgement This work is in the scope of the scientific topics of the PRC-GdR ISIS research group of the French National Center for Scientific Research CNRS. This work is supported by the European EGEE Project and by the French ministry for research ACI-GRID MEDIGRID project. This work has been also funded by the INSA Lyon French Engineering School.
5. References [1]
H. Benoit-Cattin, G. Collewet, B. Belaroussi, H. Saint-Jalmes, and C. Odet, "The SIMRI project: A versatile and interactive MRI simulator," Journal of Magnetic Resonance, vol. 173, pp. 97-115, 2005. G. Torheim, P. A. Rinck, R. A. Jones, and J. Kvaerness, "A simulator for teaching MR image contrast behavior," MAGMA, vol. 2, pp. 515-522, 1994. M. B. E. Olsson, R. Wirestam, and B. R. R. Persson, "A Computer-Simulation Program For MrImaging - Application to Rf and Static Magnetic-Field Imperfections," Magnetic Resonance in Medicine, vol. 34, pp. 612-617, 1995. A. R. Brenner, J. Kürsch, and T. G. Noll, "Distributed large-scale simulation of magnetic resonance imaging," Magnetic Resonance Materials in Biology, Physics, and Medicine, vol. 5, pp. 129-138, 1997. R. K. S. Kwan, A. C. Evans, and G. B. Pike, "MRI simulation-based evaluation of image-processing and classification methods," IEEE Trans. Medical Imaging, vol. 18, pp. 1085-1097, 1999.
[2] [3]
[4] [5]
1
www.cines.fr
42 [6]
[7]
[8]
F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture
H. Benoit-Cattin, F. Bellet, J. Montagnat, and C. Odet, "Magnetic Resonance Imaging (MRI) simulation on a grid computing architecture," Presented at IEEE CGIGRID'03- BIOGRID'03, Tokyo, 2003. J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legré, I. E. Magnin, L. Maigne, S. Miguet, J. M. Pierson, L. Seitz, and T. Tweed, "Medical images simulation, storage, and processing on the European DataGrid testbed," Journal of Grid Computing, vol. 2, pp. 387-400, 2004. S. Balac, H. Benoit-Cattin, T. Lamotte, and C. Odet, "Analytic solution to boundary integral computation of susceptibility induced magnetic field inhomogeneities," Mathematical and Computer Modelling, vol. 39, pp. 437-455, 2004.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
43
Towards A Virtual Laboratory for fMRI Data Management and Analysis , Aart J. Nederveen b , Jeroen G. Snel b and Robert G. Belleman a a Informatics Institute, University of Amsterdam b Academic Medical Center of the University of Amsterdam
Silvia D. Olabarriaga
a,*
Abstract. Functional Magnetic Resonance Imaging (fMRI) is a popular tool used in neuroscience research to study brain activation due to motor or cognitive stimulation. In fMRI studies, large amounts of data are acquired, processed, compared, annotated, shared by many users and archived for future reference. As such, fMRI studies have characteristics of applications that can benefit from grid computation approaches, in which users associated with virtual organizations can share high performance and large capacity computational resources. In the Virtual Laboratory for e-Science (VL-e) Project, initial steps have been taken to build a grid-enabled infrastructure to facilitate data management and analysis for fMRI. This article presents our current efforts for the construction of this infrastructure. We start with a brief overview of fMRI, and proceed with an analysis of the existing problems from a data management perspective. A description of the proposed infrastructure is presented, and the current status of the implementation is described with a few preliminary conclusions. Keywords. medical image analysis, grid computing, functional MRI, virtual organizations IT for large population studies
1. Introduction Functional magnetic resonance imaging (fMRI) is a popular tool used in neuroscience research to study brain function. In fMRI studies, large amounts of data are acquired, processed, compared, annotated and stored for future reference. The users of fMRI (in particular psychologists, psychiatrists, radiologists etc.) typically have limited technical background in computing and, as such, face several difficulties to organize their workflow comfortably and efficiently based on individual solutions available for personal computers. Computational resources with higher capacity and performance are needed to properly address the needs of these users. The Virtual Laboratory for e-Science (VL-e) Project1 has taken initial steps to build a grid-enabled infrastructure to facilitate data management and analysis * Correspondence to: Silvia D. Olabarriaga, Kruislaan 403, 1068 SJ, Amsterdam. Tel.: +31 20 525 7549; E-mail: silvia@science.uva.nl 1 http://www.vl-e.nl/
44
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
for fMRI studies. This article presents our current efforts for the construction of this infrastructure. We start with a brief overview of fMRI (section 2), and proceed with an analysis of the existing problems from a data perspective (section 3). The proposed infrastructure is described in section 4, followed by a brief discussion and preliminary conclusions in section 5.
2. fMRI at a Glance fMRI enables the study of brain activation in a non-invasive manner. The basic idea is to scan a subject while he/she is submitted to brain stimulation through a physical or cognitive activity. Depending on the type of sensory stimulus (visual, auditory, motor, etc.) or cognitive task, the neuronal activity increases in different parts of the brain, and a heamodynamic response occurs. In simple terms, the active region receives oxygen-rich blood, and the changes in the oxygenation level can be measured with MRI. An fMRI scanning session produces a series of 3D datasets (volumetric images) containing measurements along time, some obtained during stimulation and some at “rest”. These images are subsequently analysed to determine the location of activated areas – refer to [1] for details. First the 3D volumes in the time series are aligned to each other to compensate for artefacts introduced by temporal sampling and motion. Additionally, filters are applied to reduce noise and normalise the measurements. Next, statistical analysis is performed to correlate the measured signal with the stimulation pattern. This step generates a statistical map, which is again submitted to statistical analysis to detect activation based on an adaptive threshold. The final result of the analysis is an activation map that can be further analysed to determine the location and size of activation clusters or activated regions. Activation maps are overlaid to an additional high-resolution structural scan for visual inspection of activation with respect to the anatomy. Instead of a structural scan of the same subject, a reference brain can be used, e.g. the Montreal Neurological Institute average brain[2]. fMRI is largely used in neuroscience studies, for example, to characterize brain function in populations. Often the activated regions detected in different scans are compared, which requires additional image registration steps for their alignment to a common coordinate system. A future perspective is that fMRI will also be used in a broad range of situations in diagnosis, prognosis and treatment planning. Examples include aids to detect anomaly, prediction of functional damage due to trauma, or planning of neurosurgery [3]. And finally, the acquired data and metadata (results, annotations) are typically shared in large multi-centre studies, which are becoming increasingly popular for the characterization of large populations in neurosciences or for the evaluation of new healthcare procedures.
3. Data-related Issues in fMRI Studies From a data perspective, fMRI studies involve data acquisition, storage, analysis and shared access. Note that “data” here refers to a large variety of information
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
45
and measurements acquired or generated during an fMRI study, being heterogeneous by nature. Examples of data are scanned images (functional, structural), information about the applied stimuli, parameters of the acquisition protocol, subject characterization (e.g., age, gender, pathology), results of the analysis (e.g., statistical and activation maps, locations of activated regions) and interpretation (e.g., annotations). These data are generated by different types of physically dispersed equipments and image analysis utilities, requiring a significant amount of time and effort for adequate management. Such effort is likely to increase as the amount of data grows in response to developments in scanning techniques, analysis methods, and collaborative and multi-centre research. Below we discuss a few of the problems encountered. Data acquisition in this discussion is restricted to the images and associated signals recorded during an fMRI scanning session, involving a collection of equipments in complex experiments2 . Experiment design is based on prior knowledge about brain function and imaging protocols, which is typically accumulated and shared informally by researchers and practitioners in the field. One of the difficulties is to have access to resources with structured information about experimental design (e.g., databases of acquisition protocols or stimuli). By keeping documentation of experimental procedures, such resources could facilitate the validation of experiments, the design of new experiments and the standardization of existing ones. Experiment control involves synchronizing image acquisition (by a scanner ) with stimulation (by a stimulus computer ), as well as recording images and other signals (e.g. electroencephalography - EEG). At the end of an fMRI session, all data (images, stimuli, signals, etc.) are gathered from the acquisition equipments and exported to a remote storage resource. The acquisition equipments are often dispersed, heterogeneous, and located behind a hospital firewall. In a clinical setting, the images are stored in a Picture Archival and Communication System (PACS), while other data are manually transported using some physical medium (DVD, memory stick) or sftp. In a typical research setting, also the images are manually transported to the external storage resource for further analysis. Data gathering could be facilitated by connecting all data acquisition equipments in the scanning site directly and securely to a (remote) storage resource that can store all the collected data. Data storage for fMRI studies present three main difficulties. First, large storage capacity is needed, since studies involve many instances (typically above 20) of large datasets (500MB to 1GB per scanning session). Second, the storage system should be flexible to accommodate heterogeneous data types such as images, signals, and metadata. Although the adoption of a PACS would be the natural choice in a medical environment, current systems are still limited with respect to data capacity, image format, and storage of non-pixel data. And finally, when patient data is used in these studies, high demands are imposed on data confidentiality. Not only secure connection is required, but also all identity information must be striped from the data before they leave the scanning site. Data analysis in fMRI involves applying complex and computation-intensive image processing methods to large amounts of data. It can take more than one 2 In
fact, data and metadata are collected during the whole lifetime of a study.
46
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
hour to complete the analysis of one fMRI session on workstations that are typically available for researchers in their home institutions. Normally the analysis is executed as a post-processing step, and the results become available a long time after the scanning session. If a problem is detected during the analysis (e.g., motion artefacts), a new scanning session must be scheduled. The usage of high performance computing (HPC) resources could be helpful to reduce latency and enable interactive inspection of data quality while the subject is still at the scanning site. Additional problems are faced for running image analysis in large scale, for example, when the study includes a large number of subjects or for performing parameter optimisation. The complete analysis in these cases can take days to be completed on typical workstations. HPC capacity would be beneficial here for achieving higher throughput by parallel execution of independent tasks (e.g., analysis of individual scans). Moreover, the logistics of data and computational resources require much effort to guarantee proper error handling, enough storage space for intermediate results, proper data conversion and transfer, proper parameter settings, etc. The researchers involved in fMRI usually do not have enough technical knowledge to set-up an infrastructure to perform reliable, efficient and secure image analysis. These users could benefit from sharing a common IT infrastructure for large-scale fMRI studies, in which the workflow can be automated. Data access and sharing are challenging issues because fMRI studies are performed by groups of users associated to the same institution or to multiple centres. Moreover, a growing trend in neuroscience is to share the acquired data and generated metadata (results) with other researchers after the study is completed [4]. In this manner data can be reused, experiments can be reproduced or repeated with different settings, or results can be used for meta-analysis. The following issues characterize the demands for shared data access in this context. First, multiple and physically dispersed sites are involved, requiring remote and secure access to (distributed) data. Second, it is necessary to control and monitor access to data, respecting strict data privacy policies that may be different per site or study. Third, large amounts of data are involved, requiring mechanisms for efficient retrieval such as query based on metadata. And finally, data should be archived for long periods of time, requiring extremely large and permanent storage capacity. The scenario described above indicates that a proper IT infrastructure is fundamental to accomplish fMRI studies successfully. Table 1 summarizes the different problems faced for data management in fMRI studies and the challenges posed to the construction of an adequate infrastructure.
4. A Virtual Laboratory for fMRI: VL-f The construction of an IT infrastructure addressing the challenges in Table 1 obviously requires technical knowledge that is beyond the scope of neuroscience, and perhaps also beyond what could be accomplished with traditional computing paradigms. The characteristics of this application indicate, on the other hand, that it could benefit from grid computing approaches [5] for the following reasons.
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
47
Table 1. Difficulties for data management in fMRI and associated IT challenges. Characteristics Acquisition: Complex experiments Multiple equipments Storage: Many large data instances Heterogeneous data types Patient data
Challenges
Share and reuse experiment design Access dispersed and heterogeneous systems Large storage capacity Flexible storage system Data confidentiality
Analysis: Computation-intensive analysis Interactive response
HPC for throughput HPC for real-time computation
Large scale processing
Logistics of data and resources
Shared Access: Multiple centres
Remote access to distributed data
Many users Large amounts of data
Controlled access to (confidential) data Query based on metadata
Data Archival
Long term storage/retrieval
First, fMRI studies are data intensive, since large amounts of data are stored, analysed and manipulated. Second, they require high throughput computation on demand for real-time image analysis and for large scale studies. Finally, collaboration and distributed computing are essential, in particular for multi-centre studies, in which data is acquired, analysed and shared from different locations. This application therefore requires coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (VOs), which is the goal of grid technologies [6]. A virtual laboratory for fMRI (VL-f) is under development in the scope of the VL-e project to address some of the challenges listed in Table 1. The goal is to construct a shared computational infrastructure with hardware, software and services to efficiently, reliably and securely perform (large scale) fMRI studies. The following specific goals will be pursued: 1. Facilitate data gathering at the scanning site, by providing homogeneous access to the acquisition equipments. 2. Facilitate data storage and archival, by providing access to large capacity and long-term storage resources. 3. Enable high data analysis throughput, by providing access to HPC resources to perform parallel analysis of mutually independent data. 4. Facilitate the data logistics in (large scale) fMRI studies, by providing tools to automate the workflow (data gathering, removal of subject identity, data format conversion, and image analysis). 5. Provide remote data access via interactive interface to the storage resource from workstations located anywhere.
48
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
6. Enable secure data sharing, by providing mechanisms for controlling access to the data for users and groups. 7. Facilitate data retrieval, by providing infrastructure for generation of metadata and query mechanisms based on metadata.
Figure 1. Computational resources of the VL-f.
Below we present a description of the resources (section 4.1), the ideal use scenario pursued by VL-f (section 4.2), the plans for the first pilot implementation (section 4.3) and its current status (section 4.4). 4.1. VL-f Resources The simplified scheme in Figure 1 presents the hardware and software resources of VL-f, which are distributed among scanning, research and services sites. Scanning sites are the locations where a scanner and other acquisition devices are installed, normally in the radiology department of a hospital. These equipments are connected to each other directly, and possibly also to others, via an internal network (intranet) protected by a firewall. They are accessible only from workstations located in their physical vicinity (e.g., examination room). Some workstations have access to public networks (e.g., internet), being called grid access nodes (GANs) in the proposed scheme. In the first phase of VL-f, scanning is performed at the radiology department of the Amsterdam Medical Center (AMC), which includes a Philips 3T Intera MRI scanner, a stimulus computer, and a GAN. Research sites are locations where the (neuro)scientist interacts with the data from a workstation. In the first phase of VL-f, the research sites are located at several departments of the University of Amsterdam (UvA), the Free University (VU), and private computing facilities (e.g. at home). Workstations based on Windows or Linux platforms will be supported.
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
49
Services sites provide compute and storage resources. Although an open gridoriented architecture will be adopted, in the first phase of VL-f only the grid resources provided by the VL-e Proof-of-Concept Environment (VL-e PoC) are used. These resources are provided by SARA Computing and Networking Services3 and consist of computing elements and on- and near-line storage elements. Access to the hardware facilities and software services will be granted to authorized users associated to one or more VOs. The computing elements are accessed via an user interface machine linked to the European DataGrid (EDG). In a first phase, only the Matrix cluster at SARA will be available for the fMRI VO. This cluster consists of 36 IBM x335 nodes equipped with dual Xeon 3.06GHz processors, 2GB memory, and 2 local EIDE disks of 120GB. The nodes are connected with 1 Gb/s Ethernet. Other computers and clusters located at other VL-e partners will be added in the future (e.g., National Institute for Nuclear Physics and High Energy Physics, NIKHEF). Storage resources include on-line storage and tape silos for off-line, permanent and unlimited storage space. The data is transported automatically between the on- and off-line storage systems based on usage patterns. Data integrity and accessibility is provided by automatic and periodic back-ups. The on-line storage resource consists of the Storage Resource Broker (SRB4 ) system, which provides a seamless interface to store and retrieve data and metadata across a wide area network [7]. 4.2. Ideal Scenario The ideal functional scenario pursued by VL-f is illustrated in Figure 2. When a scanning session is completed, data from the scanner and stimulus computer are gathered into a single workstation at the scanning site. Identity information is removed from the images with an application that additionally provides aids to control pseudonyms and real identities in the context of individual or multiple studies. The user schedules the transfer of identity-free data to the SRB, performing SRB authentication with a Grid Security Infrastructure (GSI5 ) certification protocol. Metadata encapsulated by the file format (e.g., DICOM) is automatically associated with the data upon upload. The data are transferred to the SRB, and the user is notified via e-mail when and where (uri) the data have been successfully stored. The user schedules data conversion (images, stimuli data) from a workstation at the research site, indicating that the source and destination files are stored in the SRB. Grid and SRB authentication are used to enable access to the storage and the VL-e PoC computational resources, where the data conversion job is performed. The user is notified via e-mail when the conversion is complete and where the results have been stored. Several software packages can be chosen for image analysis, for example FSL (fMRIB Software Library [8]) and SPM (Statistical Parametric Mapping [9]). The user configures the image analysis parameters from a workstation at the research 3 http://www.sara.nl/
4 http://www.sdsc.edu/srb
5 http://www.globus.org/security/overview.html
50
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
Figure 2. Overview of VL-f functional components.
site, indicating that the input and output files are stored in the SRB. Authentication (SRB, Grid) takes place, and the analysis job is scheduled to run at the VL-e PoC computing resources. The user is notified when the analysis is completed, with a link to the results in the SRB. The LCG-2 grid middleware6 is used for running jobs to perform data conversion and image analysis on the VL-e PoC. Job submission is performed via a web service that encapsulates the functionality of EDG command-line utilities. Jobs are retried a given number of times, and permanent faulty conditions are notified both to the user (researcher) and to the technical support (image analysis or grid specialist, depending on the problem). At any time, and from any workstation at the research sites, the user can perform the following operations via interactive and intuitive clients and/or web portals: • monitor and control the status of scheduled jobs for data transfer, data conversion and image analysis, • browse data and inspect their content with html, text, image, and other specialized viewers, • transfer data between the SRB and the local workstation, • manage files and control data access permissions for individuals and groups, on a permanent or temporary basis, • edit metadata stored in the SRB metadata catalogue, and 6 http://lcg.web.cern.ch/LCG/activities/middleware.html
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
51
• retrieve data with queries based on metadata And finally, tools for workflow automation enable the user to combine and schedule at once tasks such as data pseudonymisation, data transfer to the SRB, data conversion, image analysis and metadata generation. 4.3. First Pilot A minimum but complete subset of the ideal functionality described in section 4.2 was selected for implementation in the first pilot. Existing software is used as much as possible to enable rapid development. Issues such as optimisation, development of intuitive GUIs, cross-platform functionality and workflow automation were left for a later phase. Data gathering is performed in a workstation that has access to the file system of the scanner and the stimulus computer as a remote drive or directory. Pseudonymisation, if necessary, is performed by an application that simply removes identity information from 4-D images stored in the PAR/REC Philips Medical Systems proprietary file format. The data is transferred into the SRB using inQ7 , a browser for Windows platforms, which only supports password authentication. inQ is also used for browsing data on the SRB, controlling user access to groups and individuals, transferring data between the SRB and the workstation, general file management, metadata management and query. Image analysis is performed with FSL and consists of a sequence of customized steps implemented by command-line utilities (binary code and scripts). These steps have been encapsulated in FSL by a tcl script that takes parameters from a single configuration file. Some parameters are used to control image analysis, while others indicate the location of input data (complete file paths to images and stimuli data) and output results. The analysis results consist of several files (images, text) that are stored in a single directory with a given name. A summary report in html generated by FLS facilitates browsing the results. For proper execution in the VL-e PoC, the FSL script needs to be wrapped into a higher-level component that handles files stored in the SRB and adequate user notification. The input files are automatically downloaded to the local file system of the computing node prior to running the original script, and the results are uploaded when it completes. Error handling, which is limited to displaying messages in the stderr in the original script, must be extended to also notify the user and the technical support. The analysis is started manually, using a job submission client running on the workstation at the research site. Lists of jobs (e.g., for several scans) can be submitted at once, in which case the analysis is performed in parallel in the available computing elements. Data conversion facilities are limited to the formats required by the FSL utilities. For images, conversion is performed from PAR/REC to NIFTI-18 format. 7 http://www.sdsc.edu/srb/inQ/inQ.html
8 See
http://www.fmrib.ox.ac.uk/fsl/fsl/formats.html.
52
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
4.4. Current Status The implementation of this pilot is in its early stage. Images can be transferred to the SRB directly, but the data from the stimulus equipment still need manual intervention. Only MS Windows-based workstations are supported for interactive access to the SRB with inQ. Scommands9 are used to upload and download data from the scripts that perform data manipulation autonomously. The data conversion and image analysis tasks have been combined into one script that is executed as a single job. Jobs are scheduled and monitored via existing utilities that offer an interactive GUI for EDG job submission. These utilities must be executed on the EDG user interface machine. Finally, no explicit metadata facilities nor specialized data viewers are present, except for those already implemented by the SRB and inQ. 5. Discussion and Conclusions Other attempts have been reported to provide grid-enabled infrastructures for medical applications. Montagnat et al. present in [10] several medical applications that can benefit from grid technology by using parallel and distributed computing for higher throughput and lower latency. Rogulin et al. describe in [11] the Mammogrid project, which uses grid technology to integrate and provide access to databases of mammograms for computer-aided diagnosis. Barillot et al. describe in [12] the Neurobase project, which uses grid technology for the integration and sharing of heterogeneous sources of information in neuroimaging. Rex et al. present in [13] the LONI Pipeline Processing Environment, a cross-platform, distributed environment for the design, distribution, and execution of image analysis for neuroimaging applications. It has a visual programming interface where a large repertoire of components can be combined to perform the desired image analysis steps. A dataflow model is adopted to support a parallel processing architecture, which enables simultaneous execution of multiple tasks. These attempts have emphasised either information or computation aspects. Our efforts, however, are focused on the construction of an infrastructure that addresses both aspects transparently, efficiently and robustly. The infrastructure proposed in section 4 has the potential to alleviate to a large extent the problems presented in section 3 because it provides large and long term storage capacity, remote and controlled access to distributed and heterogeneous data, facilities for metadata storage and query, access to HPC resources, and workflow automation. This potential remains to be confirmed when the implementation is completed and evaluated from the perspective of the end users. The few experiences with the pilot already indicate that the implementation of VL-f will be a challenging task, and that several issues should be properly addressed before the infrastructure could be considered useful. First, error detection and notification is typically poor in legacy software, being typically limited to messages written to files (stdout, stderr) or a return code. It would be desir9 http://www.sdsc.edu/srb/scommands/index.html
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
53
able to clearly notify failure and success in a compact manner to more efficiently guide the user in the inspection of results. The relevance of such a feature will increase proportionally with the scale and automation level of workflows. Second, more intuitive tools are needed to submit, monitor and control job execution, in particular for large numbers of jobs. We are currently investigating Nimrod-G [14] as an alternative for large scale job submission. Third, it is important to provide simple means to request grid services (for job submission) from any workstation at the research sites. We are planning to develop platform-independent clients that will communicate with the EDG web service implemented at the VL-e PoC to submit and control/monitor large numbers of jobs. This service is compliant with the Web Services Resource Framework (WSRF) architecture, which is under consideration also to implement functionality such as data conversion and image analysis. And finally, workflow automation must be improved, for example, by integrating the VL-f functionality into the Distributed Workflow Management System currently in use at the AMC [15]. Note that the proposed infrastructure does not include explicit mechanisms for strict management of data confidentiality, besides removing identity information and controlling access to the data via SRB authentication. Although the current strategy may be insufficient for handling patient data, it seems adequate for a large number of research studies in which the subjects are volunteers. Constructing an adequate and useful IT infrastructure, even if for a limited scope of fMRI studies, is the goal of our current efforts in VL-f.
Acknowledgements This work was carried out in the context of the VL-e project. This project is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ).
References [1] S.M. Smith. Overview of fMRI analysis. The British Journal of Radiology, 77:S167– S175, 2004. [2] A. C. Evans, D. L. Collins, , R. R. Mills, E. D. Brown, R. L. Kelly, and T. M. Peters. 3D statistical neuroanatomical models from 305 MRI volumes. In Proceedings of the Nuclear Science Symposium and Medical Imaging Conference, volume 3, pages 1813 – 1817. IEEE, 1993. [3] E.J. Vlieger et al. Functional magnetic resonance imaging for neurosurgical planning in neuro-oncology. Eur Radiology, 14:1143–1153, 2004. [4] J. D. Van Horn, S. T. Grafton, D. Rockmore, and M. S. Gazzaniga. Sharing neuroimaging studies of human cognition. Nature Neuroscience, 7(5):473–481, 2004. [5] I. Foster and C. Kesselman. Computational grids. Communications of the ACM, 35(6):44 – 52, 1998. [6] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 15(3), 2001.
54
S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis
[7] C. Baru, R. Moore, A. Rajasekar, and M. Wan. The SDSC storage resource broker. In CASCON’98 Conference, 1998. [8] S.M. Smith et al. Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage, 23:208–219, 2004. [9] K.J. Friston. Statistical parametric mapping and other analysis of functional imaging data. In Brain Mapping: The Methods, pages 363 – 385. Academic Press, 1996. [10] J. Montagnat et al. Medical Images Simulation, Storage, and Processing on the European DataGrid Testbed. Journal of Grid Computing, 2:387–400, 2004. [11] Rogulin D. et al. A grid information infrastructure for medical image analysis. In Proceedings of DiDaMIC Workshop (MICCAI’2004), 2004. [12] C. Barillot et al. Neurobase: Management of distributed and heterogeneous information sources in neuroimaging. In Proceedings of DiDaMIC Workshop (MICCAI’2004), 2004. [13] D.E. Rex J.Q. Ma and A.W. Toga. The LONI pipeline processing environment. NeuroImage, 19:1033–1048, 2003. [14] R. Buyya, D. Abramson, and J. Giddy. Nimrod-GResource Broker for ServiceOriented Grid Computing. IEEE Distributed Systems Online, 2(7), 2001. [15] J. G. Snel, S. D. Olabarriaga, J. Alkemade, H. G. van Andel, A. J. Nederveen, C. B. Majoie, G. J. den Heeten, M. van Straten, and R. G. Belleman. A distributed workflow management system for automated medical image analysis and logistics. In accepted to IEEE-CMBS, special track on Grids for Biomedical Informatics, 2006.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
55
Service-oriented Architecture for Gridenabling Medical Applications Anca BUCUR, René KOOTSTRA, Jasper van LEEUWEN and Henk OBBINK Philips Research, High Tech Campus 31, 5656 AE, Eindhoven, the Netherlands {anca.bucur, rene.kootstra, jasper.van.leeuwen, henk.obbink}@philips.com
Abstract. Grid technologies have the potential to enable healthcare organizations to efficiently use powerful tools, applications and resources, many of which were so far inaccessible to them. This paper introduces a service-oriented architecture meant to Grid-enable several classes of computationally intensive medical applications for improved performance and cost-effective access to resources. We apply this architecture to fiber tracking [1,2], a computationally intensive medical application suited for parallelization through decomposition, and carry out experiments with various sets of parameters, in realistic environments and with standard network solutions. Furthermore, we deploy and assess our solution in a hospital environment, at the Amsterdam Medical Center, as part of our cooperation in the Dutch VL-e project. Our results show that parallelization and Grid execution may bring significant performance improvements and that the overhead introduced by making use of remote, distributed resources is relatively small. Keywords. Grid computing, service-oriented architecture, fiber tracking, performance, speedup
Introduction The Grid provides transparent, ubiquitous, scalable and secure access to large amounts of various resources anywhere and at any time. Grid technology may provide organizations like medical institutions with powerful tools through which they can gain coordinated access to resources otherwise inaccessible to them. This technology also has the potential to enable new applications that were not possible before, for example applications that require high-performance or high-throughput computational power, or a large number of various resources usually not available at one site. Healthcare organizations may use Grids to share resources, to access remote resources, but also to become service providers. We propose a service-oriented architecture that enables computationally intensive medical applications to use Grid technologies for improved performance and costeffective access to computational resources. We have first assessed different types of applications that could benefit from Grid technology, and distinguished three distinct classes of such applications for which we developed a generic architecture [3]. Next, we choose several relevant medical applications fitting in the identified classes and apply our architecture to “gridify” (parallelize and enable to use grid resources and technology) these applications. As a first case study, we apply our GAMA architecture to fiber tracking, a computationally intensive medical application suited for parallelization through
56
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
decomposition. We run this application in a Grid environment and carry out performance measurements for various sets of parameters. Our end-to-end solution allows for the remote exploitation of powerful Grid resources while preserving the current way of working in the hospital, i.e. the use of Grid resources is transparent for the users of the application. In fiber tracking algorithms, one of the computationally intensive elements regards the amount of starting points that is used to “search” for fibers that satisfy the multiROI (region of interest) search criterion. In this Grid-enabled application we set the algorithm to simply use all voxels of the brain as starting points (full volume fiber tracking). Our results described in this paper show that for this application parallelization and Grid execution bring significant performance improvements. The additional communication overhead introduced by making use of distributed, remote resources is relatively small, with standard network solutions, and does not preclude the cost and performance benefits of deploying Grid-based applications.
1. GAMA Overview In this section we first briefly introduce the applications types targeted by our architecture. Next, we describe the general architecture and reflect on its potential benefits and on the reasons behind our choices. 1.1. Target applications In previous research [3], we have selected a number of computationally challenging medical applications. Their clinical use is currently hampered by the unavailability of sufficient computational power. The medical imaging applications that we have analysed are all suited for parallelization through decomposition and fit into three distinct classes of decomposition patterns that allow them to exploit parallelism in three ways: computational, domain and functional decomposition. Their algorithms exhibit a significant degree of spatial locality in the way they access memory as well as time locality in the sequence of operations that are performed. These applications manipulate large amounts of data, but the communication entailed by their parallel solution is low enough not to preclude (Grid-based) distributed execution. We assume that once gridified their bandwidth requirements should allow for the use of standard network solutions available to healthcare organizations. It is expected that the sizes of the data sets will increase in time, requiring increasingly powerful computational resources. Furthermore, with the availability of increasing computational power and wide access to (geographically) distributed resources it may also be expected that new applications, not possible before, will emerge and some existing applications will become more relevant. 1.2. Architecture In this context, our goal is to provide a generic architecture suitable to host a wide range of applications. Next to improving the performance of each individual application, the architecture should be scalable relative to the data volume, and should
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
57
allow changes in the computational algorithm with a minimum of changes in the algorithmic structure and preferably no change at all in the architecture itself. By studying the common and differentiating features of the application classes that we selected, we were able to design a generic Grid architecture that can be applied to applications fitting at least one of the above mentioned decomposition patterns, and that enables us to gridify these applications, i.e. parallelize them and let them make use of resources available on the Grid. The three decomposition patterns we selected can all be modeled within the general framework depicted in Figure 1. The underlying architecture is designed to simultaneously support distinct applications fitting at least one of the decomposition patterns. This architecture is service-oriented, in the sense that it supports the transition from software licensing to services: the application provider may install a thin client interface in the hospital to access remote (Grid) resources where the actual application would run, while the cost for the end-user would be based on usage. As Figure 1 shows, we chose for a basic two-tier client-server architecture. The server can simultaneously provide different sets of services for each of the application clients, which normally reside in medical workspots somewhere in the clinical workflow. Our primary intention is to make this framework minimally invasive, in the sense that the use of external remote resources is transparent to the end-user. As initial alternative, we considered that instead of enforcing the Grid-based solution it could be offered as an option: Depending on the computational requirements and the quality of service needed and on the availability of Grid resources, the client application may automatically choose whether to use external Grid resources or not. In the event that insufficient external resources are available, the applications would automatically fall back to use locally available ones. In the future, when enough Grid providers are available and they can reliably ensure the required quality of service, such that sufficient remote resources are always obtainable when needed, the local version can be safely removed. Currently, the medical applications as part of the clinical workflow are hosted on Windows workstations at the client side. At the other end, Grid technology is centered around Globus [4], a software interface that provides the Grid “middleware”. Since Globus is Unix-based, in order to enable such applications to use the Grid for their execution while maintaining the standard way of working in hospitals, the computeintensive part of the application has to be “removed” from the rest of the application and ported to the Grid environment. To provide an interface from the Windows environment to a Grid infrastructure that is designed around Globus, we designed a Unix-based “Grid Access Point” (GAP) module that receives the requests from the client side and allocates the needed processors on the Grid, passing on the requests and returning the results to the client. We opted for a lightweight client providing user interfaces and visualization. The GAP uses Globus for submitting the requests to the Grid nodes, thereby exploiting the security and execution facilities offered by it. Globus is entirely Unixbased, so it could not be used if we chose for directly connecting the client side to the Grid nodes.
58
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
Medical Workspot
Medical Workspot
Medical Workspot
Application 1
Application 2
Application 3
client server
client server
GRID Access Point
GRID Resources
Figure 1. GAMA overview
Often the client side, which may be a regular clinical site, had slow external network connections. Therefore, the overall performance of the application benefits from keeping the client side as little involved as possible in the computational algorithm and from restricting the communication with the rest of the application to job-submission requests only. For high throughput, the GAP should be connected to the Grid infrastructure via a fast network.
2. Case Study: Fiber Tracking In this section we describe our results of applying the GAMA architecture to a compute-intensive medical application suitable for computational decomposition, the fiber tracking application [5, 6]. 2.1. Description White fiber tracking is an indirect medical imaging technique, based on Diffusion Weighted Imaging, that allows for the extraction of the connecting pathways between brain structures. There are various solutions to fiber tracking, but the common feature is that starting from a number of points, white matter fibers have to be tracked in the entire domain. The more starting points are considered in a certain area, the more fibers can be detected that cross that area. For areas with a high concentration of fibers, too many detected fibers may also lead to a clogged, indistinguishable image. Also extending the
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
59
region in which the starting points are distributed yields a larger number of detected fibers. Considering too few starting points or a region too small may result in low accuracy, i.e. relevant fibers being missed. The best selection method for the relevant fibers is therefore not to consider few starting points, but to specify a number of regions of interest (ROI) that the tracked fibers should cross.The time to run such an application depends on the number of starting points, the algorithm, and the size of the data set, and can amount to several hours. When tracking the fibers crossing one or more ROIs, a common approach is to start tracking from all the voxels in one or more of the ROIs. Another approach is to start tracking from the entire domain, either from every voxel, or from a selection of voxels (uniformly distributed or not) spread over the entire domain. The advantage of the full volume tracking is that it detects a larger number of fibers. It can also detect crossing, splitting and touching fibers. It is however slower, the runtime of the algorithm amounting to several hours (depending on the voxel size and on the number of voxels selected as starting points). This type of algorithm is the ideal case for parallelization: a distributed solution can increase the throughput without decreasing the accuracy of the result. In order to pay off, the parallel solution should to be scalable, and the communication overhead among the processors performing the algorithm should to be small. For this application, fibers are tracked in the entire data domain, so directly decomposing the domain among the participating processors would not be viable because of the large need for communication and synchronization among the processors. The starting points however can be distributed among the processors without generating additional synchronization or communication needs. Therefore, this problem is well suited for computational decomposition, meaning that each processor taking part in the computation receives the entire data domain, but the computation domain, i.e. the starting points, is divided among the processors. Instead of tracking fibers starting from the entire domain, starting from the voxels in one or more ROIs is faster, but fewer fibers are detected. The algorithm may miss crossing or splitting fibers when the region of interest does not include the areas where fibers split or cross. This case is also not suited for parallelization for small ROIs, because the communication overhead may be larger than the decrease in computation time due to parallelism. For large ROIs a parallel solution may still provide a reasonable speedup. The threshold for which a parallel solution starts being advantageous can only be detected through experiments. When the region of interest is a single voxel, at most one fiber, crossing that voxel, can be detected. For multiple ROIs, fibers crossing all ROIs have to be detected. There are two options, either all fibers crossing one ROI are detected and then it is checked whether those fibers also cross the other ROIs, or for each ROI the fibers are tracked and then the intersection of the set of fibers for all ROIs is computed. For a sequential solution the first approach, which is inherently sequential, should provide better throughput. The second approach may be parallelized by splitting the ROIs among the participating processes, but the intersection of the fibers has to be done sequentially, after all the processors finish their computation, which requires a large communication overhead and a strict synchronization. This solution is not scalable. Also in this case experiments can provide the numbers of starting points and regions of interest for which a parallel solution that splits the ROIs among participating processes improves the performance, but probably parallelization does not pay off for a realistic (small) number of regions of interest.
60
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
An algorithm with multiple regions of interest can be extended to use starting points from the entire domain, in order to detect crossing, splitting and touching fibers. It can be parallelized efficiently by computational decomposition, splitting the starting points. Similarly to what we explained above, a second level of parallelization that would split the regions of interest among processors does not seem to be advantageous. Separately detecting for each set of starting points the set of fibers crossing each of the ROIs and at the end computing an intersection of the sets of fibers introduces both communication and computation overhead. Only a large number of ROIs and a large number of fibers crossing all (most of) the ROIs may compensate for this overhead. We can investigate the existence of such a threshold, but the lack of scalability makes such a solution unsuited for parallelism. Our Grid-enabled fiber tracking application described in this paper implements the full volume approach, and uses ROIs to select the relevant fibers. 2.2. Use Scenarios for the Fiber Tracking Application In this section we present several scenarios describing possible uses of the fiber tracking application in healthcare, addressing as well the main requirements in those scenarios and the benefits of a distributed solution. Table 1 presents the results of this analysis. 2.2.1. The screening scenario One of the applications of fiber tracking in healthcare is the processing of large numbers of data sets to obtain information regarding the geometry of the brain or to detect neurological afflictions. For this type of medical applications there is the need for high accuracy and high throughput. Full volume fiber tracking has a large execution time in the sequential version. With a parallel solution on the Grid, the execution time can be significantly reduced, despite the extra communication overhead. 2.2.1.1. Scientific medical research This type of applications need to perform fiber tracking on a large number of data sets, in order to obtain the full geometry of the brain and to extract information relevant for the medical research, such as a brain map, changes in the brain with age, or due to various neurological illnesses. Such research receives much attention in healthcare, but it is not yet fully enabled by the existing applications and technology. Fiber tracking could provide important information about brain activity, connections among brain regions, and brain modifications as a consequence of aging, of a learning process, or of diseases. However, the current (sequential) application cannot fulfill the requirements of a research tool, because it is either too slow, or it does not reach the desired accuracy. For this use scenario the results must be very accurate in order to be relevant. The performance is also important since typically such research needs many (batch) computations of the geometry on lots of data sets in order to be able to reach conclusions. The execution time for individual data sets and application runs is not critical, and often batch processing is used. But due to the large number of application runs on many data sets that are necessary in order to make the results conclusive, a large throughput becomes essential.
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
61
2.2.1.2. Screening for detecting neurological disease and modifications of the brain The existing fiber tracking application is not yet suited for a screening procedure, in which a large number of patients would be scanned for detecting neurological problems, or just for a follow up of the changes in the brains of recovering patients (e.g. patients re-acquiring speech abilities lost as a consequence of accidents or brain surgery). These are still research issues, but we may expect that they will become current practice in healthcare. For this medical application to be relevant a large number of patients need to be scanned in a relatively short time, therefore the execution time of the algorithm has to be short to provide a good quality of service for individual patients and a high throughput for the screening process. Again high accuracy and high performance are required, which brings along the need for a parallel solution. 2.2.2. Intra-Operative Fiber Tracking Assisted Neurosurgery and Preoperative Planning In the case of lesions or tumors in the brain, surgical intervention is used as a last resort when other types of treatment have failed. The ability to distinguish between different tissue types during surgery, and more importantly during cutting, is of utmost importance to the surgeon as injury to healthy brain tissue may result in nerve or muscle paralysis and loss of mental functions. In addition to this, connectivity between different brain areas is of increasing interest as it provides additional information to infer the brain functional organization. White matter fiber bundles form the foundation for this connectivity. Various methods are in use to guide the surgeon during surgical intervention. Before the procedure begins, imaging techniques such as MRI and CT are used to determine the location and shape of the malignant tissue and its orientation with respect to healthy tissue. White fiber tracking can be used to image the connectivity between different brain regions. In lesion or tumor extraction, an opening is made in the skull and the malignant tissue is carefully removed without harming healthy tissue. To aid in the extraction, intra-operative ultrasound or X-ray is used to guide surgical instruments while in some cases laser projection on the brain and other guidance systems are used to direct the surgical instruments. Unfortunately, the brain changes shape both when the initial opening is made in the skull as well as during the extraction of tissue. This is especially a problem with "deep" tumors that are not located near the surface of the skull. Intra-operative MRI and Diffusion Tensor Imaging (DTI) techniques such as fiber tracking can aid the surgeon in determining the changed morphology of the brain. In this scenario, surgery takes place in an MRI facilitated operating theater. During surgery, the surgeon can decide to acquire new images of the brain by sliding the patient in-and-out of the MRI scanner. Each time a new MR scan has been acquired, all post-processing of the data needs to be completed within a strict timeframe so that surgery is not halted for too long. To accomplish this, a parallelized version of fiber tracking has to be used. The data set is first transferred to the parallel computation platform, where the fibers are tracked, after which the results are transferred back to the operating theater for presentation. This scenario will only be effective if the overhead caused by transferring the original data to the fiber tracking algorithm and the results back to the operating theater is much smaller than the performance gain obtained from parallelization.
62
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
2.2.3. The training and education scenario In this scenario, fiber tracking is used to provide medical students, radiologists and surgeons information about brain activity, connections among brain regions, and brain modification as a consequence of accidents or neurological afflictions. It could also develop into a useful tool for training surgeons, by providing access to a database of interesting cases and follow-ups of past interventions, and even a virtual intervention tool. Also for this scenario a high performance and a high accuracy of the fiber tracking are desired. However, these requirements are less critical than in the previous scenarios. Table 1. The importance of the main requirements for the different scenarios
Scenario Screening for scientific research Screening for disease detection Preoperative planning Assisted neurosurgery Training and education
Response time + ++ ++ +++ ++
Throughput ++ +++ +++ +++ +
Accuracy +++ +++ +++ +++ ++
Critical No No No Yes No
2.3. The environment This research is carried out as part of the “Virtual Laboratory for eScience” (VLe1) project. The VL-e project addresses challenging problems, including the manipulation of large scientific datasets, computationally demanding data analysis, access to remote scientific instruments, collaboration, and data dissemination across multiple organizations. The methods, techniques, and tools developed within the VL-e project are targeted for use in many scientific and industrial applications. The project will develop the infrastructure needed to support these and other related e-Science applications, with the aim to scale up and validate the developed methodology. The VL-e philosophy is that any Problem Solving Environment (PSE) based on the Grid-enabled Virtual Laboratory will be able to perform complex and multidisciplinary experiments, while taking advantage of distributed resources and generic methodologies and tools. In the VL-e project, several PSEs will be developed in parallel, each applied to a different scientific area. One of these areas is Medical Diagnosis and Imaging which is the focus of this research. The sequential fiber tracking application was built with the Philips Research Imaging Development Environment (PRIDE), which allows for the creation and the execution of prototype tools and other experimental software, on a Windows NT-based machine using the Interactive Data Language (IDL)2. This language has built-in algorithms and routines, a drag-and-drop GUI builder, data visualization capabilities, and cross-platform portability. Our experiments with the distributed fiber tracking were performed on the Distributed ASCI Supercomputer (DAS-2)3. DAS-2 is a five-cluster wide-area computer system located at five Dutch universities. The system was designed and deployed by the Advanced School for Computing and Imaging and is used for research 1
http://www.vl-e.nl/ http://www.rsinc.com/idl/ http://www.cs.vu.nl/das2
2 3
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
63
in parallel and distributed computing. It consists of 200 1.0 GHz Dual-Processor Pentium-III nodes split into five clusters, one with 72 nodes, the other four with 32 nodes each. SURFnet4 (100 Mb/s) interconnects the clusters, while Myrinet5 (2 Gb/s) is used for local communication. The operating system on the DAS-2 is RedHat Linux version 7.3. Programs are started using the SGE batch queuing system, which allocates the requested number of nodes for the duration of a program run. We submitted the jobs to the DAS-2 using the Globus toolkit, which is installed on all DAS-2 clusters. The client workstation is connected to the GAP by a 10 Mb network, while the GAP is connected to the DAS-2 system by a 100 Mb network. Finally, as part of the cooperation within the VL-e project we have deployed the Grid-enabled Fiber Tracking application and ran experiments at the Amsterdam Medical Center in Amsterdam, while keeping the Grid Access Point running at Philips Research Laboratories in Eindhoven, and allocating Grid resources from the DAS-2 system. Also these experiments, in a realistic hospital environment and with standard network solutions, have shown that the fiber tracking application can obtain significant performance benefits from the use of Grid technologies. 2.4. Architecture and design details The specific architecture resulting from applying the generic GAMA architecture to the fiber tracking application is depicted in Figure 2. The purpose of this experiment is to validate the GAMA architecture and to assess the performance and the scalability of the Grid-enabled application. We started from a sequential version implementing full volume fiber tracking, i.e. fibers are tracked from every voxel in the data domain. The Grid-enabled application should gain performance by distributing its computational part, the fiber tracking algorithm, across Grid computational resources. hospital
grid access
client
results
gap
dataset params regions results
node 0
dataset params regions
distributor
results
node 1
worker
dataset
fiber tracking
dataset params regions
grid cluster
scanner
dataset params regions results
node n
worker
Figure 2. GAMA applied to fiber tracking
We first developed a distributed solution to the fiber tracking algorithm and assessed its performance in a Grid environment for different numbers of processors. 4 5
http://www.surfnet.nl/ http://www.myri.com/
64
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
Our previous experiments [3] had shown that tracking long fibers takes noticeably longer than tracking short fibers, or checking areas with no fibers. It is also the case that fibers are grouped in large bundles. Since in our solution the results are only sent back to the GAP when all the tasks computing valid starting points have completed, the longest task determines the execution time of the module. This implies that simply splitting the computational domain into a number of sub-domains equal to the number of processors is not an efficient solution: Processors receiving parts of the domain with many long fibers perform a large amount of work, while processors receiving parts of the domain with no fibers spend most of the time waiting. As an alternative, we have split the domain on one of the axes in slices of width equal to the size of the voxel and implemented a workpool–based solution. One of the nodes executing the algorithm (the distributor) is only responsible for sending the parameters and the data to the other nodes (workers), distributing the workload, i.e. the slices with starting points, and collecting and sending the results at the end of the computation (see Figure 3). Each worker node takes one slice at a time from the distributor, tracks the fibers from the starting points in that slice, and computes and stores the valid starting points. This repeats until the worker node receives the termination message from the distributor, indicating that all the slices have been distributed. The worker node can then return the valid starting points identified and finish. The distributor assembles all partial results received from the worker nodes, sends them to the GAP, and terminates. Next, we extended the initial fiber tracking application with the distributed fiber tracking module. The original code of the application has been modified to use the new, gridified module instead of the local algorithm. The role of the GAP is to receive the requests and the data from the client side of the application, to start the distributed module on Grid resources, to send the requests and the data to the distributor on the Grid, to collect the results and to send them back to the client. The GAP can simultaneously serve multiple client applications. The communication between the client side of the application and the GAP, and between the GAP and the distributor is packet based. The total size of the datasets used in our experiments amounts to 22.4 MB; the sizes of the other packets are comparatively very small.
Figure 3. The flow of starting points and results for the distributed fiber tracking algorithm
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
65
3. Results In this section we assess the performance of our gridified fiber tracking application. In Figure 4, the fiber tracking application is used to detect fibers in the proximity of a brain tumor (the pre-operative planning scenario). Figure 5 depicts a screenshot of the fiber tracking application.
Figure 4. Fiber tracking through a single ROI for a patient with brain tumor
Figure 5. Screenshot of the Grid-enabled fiber tracking application
66
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
We first evaluate the performance of the distributed algorithm implementing the computational part of the fiber tracking application. In that case, only the communication overhead among the worker nodes is considered. The scalability results for the distributed fiber tracking algorithm for up to 128 worker nodes (plus one distributor) are depicted in Figure 6. The speedup of our solution is almost linear and the communication overhead for transferring data and results between the distributor and the worker nodes is very small. In all these cases we have also studied the load balancing among the worker nodes. Due to the dynamic load balancing we implemented, the durations of the tasks were similar for up to 128 nodes. Therefore, we concluded that the size of the work items (slices with starting points) we chose is small enough for a good load balancing in this case. Decreasing the work size would yield higher communication overhead between the distributor and the worker nodes, increasing it may result in some of the workers spending time waiting for the others to finish, both cases having a negative influence on performance. For this problem, the fraction of the algorithm that tracks a fiber from a starting point is inherently sequential and sets a hard limit on the speedup. With increasing numbers of worker nodes, the work-items size may finally reduce the speedup, even for dynamic load balancing. However, taking into account the performance of the application on 128 nodes, we conclude that it is not useful to increase the number of nodes performing the computation and that the chosen work-items size is well suited for our case study.
Figure 6. The speedup of the distributed fiber tracking application
Table 2 depicts the performance of the end-to-end solution, taking into account beside the communication overhead of the distribution solution, the transfer overhead of the parameters and results between the client and the GAP, and the GAP and the distributor node. The results show that despite the extra overhead introduced, the Grid-based fiber tracking outperforms the local, sequential solution when it uses at least 3 worker nodes.
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
67
Table 2. The response time of the distributed algorithm at the client side
Version Sequential Distributed Distributed Distributed Distributed Distributed Distributed Distributed Distributed
Resources Local Grid Grid Grid Grid Grid Grid Grid Grid
No. of nodes 1 1 2 4 8 16 32 64 128
Response time [s] 464.50 1266.48 636.85 319.69 164.31 81.13 42.16 22.51 11.79
Table 3. Comparison of the relevant steps in the sequential and the Grid-based fiber tracking
Applicat.
Sequential
Comput.
Transfer dataset (22.4Mb)
464.5s
Visualiz.
0
Transfer params results
Communic. & 0
0
< 1s
< 2s
24.5s
Grid-based, 128 nodes
14.5s 8.8s
24.5s
The rendering of the fibers from the valid starting points (visualization) is not part of the computational algorithm, and was preserved at the client side. Therefore, the duration of this step is identical for the local and for the Grid-enabled solution. Table 3 compares the duration of the main steps for the sequential fiber tracking and for the distributed fiber tracking on 128 nodes. These estimates show that the overhead introduced with transferring the data to the Grid, and the communication overhead of the distributed solution are rather small, even for standard (low bandwidth) network solutions. Visualization, which in the sequential fiber tracking was almost unnoticeable for the end-user in comparison with the computational algorithm, became the longest step in the gridified application.
4. Conclusions and Future Work
In this paper we have described our service-oriented architecture aiming to enable compute-intensive medical applications to make use of the Grid. We have applied this architecture to a real clinical application, allowing it to access Grid resources for its computational part, while preserving its Windows-based interface. This way, the use of Grid technologies and resources is transparent to the end-user of the application, and the current way of operation at the hospital side is maintained. Our end-to-end solution exhibits significant performance improvements when run on standard compute systems and making use of standard network connections. We have also deployed the Grid-enabled application in a hospital environment, at the Amsterdam Medical Center, with similarly good results.
68
A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications
Although not apparent for the fiber tracking application, we expect that the communication overhead from transferring images to the computing back-end would significantly impact the performance, and reduce the speedup obtained from parallelization, for applications processing very large data sets. Therefore, GAMA would benefit from an architecture where data is stored “closer” (in terms of time required to communicate) to the computing back-end. Such a situation could occur when hospitals store medical data on remote storage resources, possibly part of a Grid. Having the data close to the computational resources would decrease the communication overhead that now occurs in GAMA. The fiber tracking test case described in the previous section illustrates an example of computational decomposition, one of the three decomposition paradigms that we have identified. Our current and future work includes applying the GAMA architecture to applications fitting the other two decomposition paradigms, extending the GAP to serve multiple types of applications, and implementing a distributed GAP to deal with the bandwidth bottleneck when running simultaneously a large number of applications.
5. Acknowledgements
Part of this work was carried out in the context of the Virtual Laboratory for e-Science project. This project is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ). Philips Medical Systems provided the data sets used in our experiments and the sequential fiber tracking application that constitutes the basis for our Grid-enabled case study application.
References [1] [2] [3] [4] [5] [6]
S. Mori and P.C. van Zijl. Fiber tracking: principles and strategies - a technical review. NMR Biomed, 15(7-8):468–480, 2000. D. Xu, S. Mori, M. Solaiyappan, P.C. van Zijl, and C. Davatzikos. A framework for callosal fiber distribution analysis..Neuroimage, 17(3):1131–1143, 2002. A.I.D. Bucur, R. Kootstra and R. Belleman. A grid architecture for medical applications. HealthGrid, 2005. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl. J. Supercomputer Applications, 11(2):115–128, 1997. C. Preibisch , U. Pilatus, R. Bunke, F. Hoogenraad, F. Zanella, H. Lanfermann. Functional MRI using sensitivity-encoded echo planar imaging (SENSE-EPI). Neuroimage, 19(2):412-421, 2003. F.G.C. Hoogenraad, R.F.J. Holthuizen, R. Brijder. High angular resolution diffusion weighted MRI. International Patent, WO2005076030, 2005.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
69
Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation of Statistical Parametric Mapping Analysis S. BAGNASCOa,1 , F. BELTRAMEb, B. CANESIb , I. CASTIGLIONIc, P. CERELLOa, S.C. CHERANa,d, M.C. GILARDIc , E. LOPEZ TORRESe, E. MOLINARIb, A. SCHENONEb, L. TORTEROLOb a
Istituto Nazionale di Fisica Nucleare, Sezione di Torino, Torino, Italy b BIOLAB, Dipartimento DIST, Università di Genova, Italy c IBFM CNR, Università di Milano Bicocca, Istituto H San Raffaele, Milano, Italy d Dipartimento di Informatica, Università di Torino, Italy e CEADEN, Habana, Cuba
Abstract. A quantitative statistical analysis of perfusional medical images may provide powerful support to the early diagnosis for Alzheimer’s Disease (AD). A Statistical Parametric Mapping algorithm (SPM), based on the comparison of the candidate with normal cases, has been validated by the neurological research community to quantify ipometabolic patterns in brain PET/SPECT studies. Since suitable “normal patient” PET/SPECT images are rare and usually sparse and scattered across hospitals and research institutions, the Data Grid distributed analysis paradigm (“move code rather than input data”) is well suited for implementing a remote statistical analysis use case, described in the present paper. Different Grid environments (LCG, AliEn) and their services have been used to implement the above-described use case and tackle the challenging problems related to the SPM-based early AD diagnosis.
Keywords. Alzheimer’s disease, statistical analysis, distributed databases, grid computing
Introduction Alzheimer’s Disease (AD) is the leading cause of dementia, accounting for more than half of all dementias in elderly people. Clinically, AD is characterized by a progressive loss of cognitive abilities, the memory loss typically being the earliest sign of the disease. The qualitative analysis of medical images hardly provides useful suggestions for the diagnosis in AD. On the other hand, a statistical comparison of PET/SPECT images from 1 Corresponding author: Stefano Bagnasco, Istituto Nazionale di Fisica Nucleare, Via Pietro Giuria 1, 10125 Torino, Italy. E-mail: bagnasco@to.infn.it
70
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
suspect AD patients with PET/SPECT images from a database of normal cases is a powerful tool for an early diagnosis of AD. With this goal the use of Statistical Parametric Mapping Analysis (SPM) for the quantification of ipometabolic patterns in brain PET/SPECT studies of patients in early stages of AD has been proposed in literature [1]. In Section 1 the scenario and the software tools of the clinical application are described. Section 2 describes the general features of the Grid architecture. Section 3 concerns middleware issues and a detailed description of the two different implementations. In Section 4 some preliminary conclusions are presented.
1. The Statistical Analysis Use Case The SPM software library was originally developed and is made freely available by the Functional Imaging Lab (FIL) at the Wellcome Department of Imaging Neuroscience (London University College) for activation studies in functional MRI [2]. Since then, the use of SPM was extended and, through a specifically defined analysis protocol, SPM routines are presently the standard within the neurological research community as regards a voxel-based analysis of PET/SPECT studies for the early diagnosis of AD. In order to achieve correct results, the SPM software library provides a number of functionalities related to image processing and statistical analysis: normalization, coregistration, smoothing, parameter estimation, statistical mapping.
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
71
Figure 1. Results of a SPM analysis on a PET study of glucose metabolism in a patient with dementia. Ipometabolic pattern in the frontal cortex: design matrix (top right), statistically significant clusters on a glass brain in three orthogonal planes (top left) and on a 3D brain rendering (bottom)
The statistical parametric mapping algorithm (the most important functionality for our goal) performs a statistical analysis in order to compare, on a voxel-by-voxel base, the perfusion values in the test images against the corresponding values in normal images. A number of parameters such as the age of the patient and the average cerebral flow are taken into account. The whole software analysis sequence has been scientifically validated and even a small alteration would imply the necessity of a new evaluation by the Scientific/Clinical Community. The results of a SPM statistical analysis, shown in Figure 1, include ipometabolic maps and the related views of the brain. As the understanding of functions used in the image analysis is very important to provide the correct values of parameters and to understand the results, only selected users should access SPM analysis in order to avoid errors in diagnosis. On the other hand the remote access to SPM analysis could provide doctors from peripheral hospitals with an invaluable tool to increase the “comparison database” and therefore improve the AD diagnosis. On these bases, as a result of a previous research project [3], remote access to SPM is being made available through the Italian Portal of Neuroinformatics.The portal contains a section entirely dedicated to the statistical analysis of PET/SPECT images, accessible by authorized users only. Doctors or researchers accessing the portal may thus be supported in running analysis tasks on suspect AD patient studies. Directly from the portal, a user can upload the suspect AD image and select the normal cases for statistical calculation (Figure 2). The SPM application is available to authorized users without downloading any software tool. In order to use it, no particular hardware resource or specific computer knowledge is needed.
72
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
Figure 2. Functional data flow of access to SPM application through portal
The SPM Graphic User Interface (GUI) manages the decisional and computational data input and the graphical output. In order to make the application available to users on the net, the first step is the replacement of the SPM GUI with a web portal interface ZOPE, an open source application server for building content management systems, intranets, portals, and custom applications [4], has been used for the construction of the Portal, implemented with the Python programming language. The information for the statistical analysis is therefore collected through a configuration file created by the web GUI (Figure 3).
Figure 3: Connection between system and portal.
The main issue with this configuration is the large set of options for the execution of the SPM algorithm. In order to help users to conduct a correct statistical analysis, a lot of parameters have been set to default values and only a few parameters are set by the user. A script that drives the input data collection for SPM from normal images has been implemented.
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
73
2. The Grid Implementation In order to evaluate the potential advantage of porting such an implementation to a Grid environment, it is worth noting that during the statistical parametric mapping a large set of images of normal patients is required to be used for comparison. This is because the accuracy of hypoperfusion maps is strictly related to the number of normal studies compared to the test image. On the other hand, due to ethical issues and to the high costs of neuroimaging technologies, PET and SPECT studies on normal subjects are very rare. The NEST-DD project, funded by the European Commission, collected a database of about 100 images in order to make available the first large dataset for these studies. Moreover the images of normal subjects are covered by privacy and security issues and for this reason they cannot be freely moved on the net or published by the centre that made the analysis. As a consequence, only doctors working at very large institutions, locally owning large databases of normal images, can usually carry out SPM-based analyses. Starting from these considerations, the aim of our project has been to enable doctors from small peripheral hospitals to use large sets of normal PET/SPECT images provided by medical research institutes distributed on the net, by remotely extracting the information needed for the statistical analysis from the normal images and collecting it without moving the original image files.
Figure 4. A Grid implementation of the SPM portal services.
Furthermore, the execution time of the analysis must be compatible with an interactive clinical application in a busy medical environment. The time required for the analysis can be reduced, since: x some aspects of calculation could be parallelized and distributed on the computational resources associated to the remote databases of normal images; x the time required for data transfers over the network would be reduced, since the code amounts to just few KB, compared to images sizing up to 100 MB.
74
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
The use of GRID technologies well matches all of the above issues and allows easy access to distributed data as well as to distributed computational resources. In particular, through data-management grid services doctors can access normal PET/SPECT image databases without moving images between hospitals, thus complying to privacy regulations. Through computational grid services statistical information can be extracted from normal images, and the image matrix and other information needed for the statistical analysis can be transferred to the management node without moving images. Moreover, this process can be executed in a parallel way on every repository machine to improve computing performances. The basic architecture of the GRID implementation of the SPM portal service is described in Figure 4. The different steps needed to complete the analysis sequence are listed below: 1. Acquisition of the test image on the user node 2. Transfer of the test image to the management node 3. Query on DB catalogue of normal images 4. Transfer of a small software executable for information extraction to the repository nodes 5. Extraction from normal images of the information needed for the statistical analysis 6. Transfer of the extracted information to the management node 7. SPM statistical analysis on the management node 8. Transfer of SPM results to the user node Thus, in terms of Grid elements, repository nodes are grid sites, comprising at least a Storage Element (SE) and a Computing Element (CE) service, while the management node runs a User Interface (UI) functionality, since it must access remote central services (the data and metadata catalogues and possibly some job submission system). With this configuration, there is no need to install grid-specific software on user nodes, since all services are accessed through the web portal on the management node. The portal and the remote central services are connected through some queries, built with specific ZOPE functionalities, to the data and metadata catalogues. 2.1. Security and user authentication The user authenticates to the ZOPE portal via simple username/password authentication. For grid interactions, however, further authentication is needed via the user’s X509 certificate and a MyProxy delegated credential mechanism [5]. Briefly, the user registers a renewable proxy in a MyProxy server, then the portal gets a delegated proxy, being authorized by the user with a specific password. From then on, that proxy is used to authenticate to both systems (AliEn and LCG) and the security infrastructure is the one provided by the underlying middleware.
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
75
3. Middleware Issues and Implementation Choices Since the beginning of Grid research and activities, a number of different software suites have appeared, sometimes in the form of low-level toolkits (like Globus, of which some components are becoming de-facto standards), sometimes as individual services and, in some cases, as fully-fledged end-to-end solutions. Given the European context of the project, and the background of many authors, two software suites were evaluated: LCG/gLite [6], [7] and AliEn, [8] the latter being already used by other Grid applications developed by the MAGIC-5 collaboration. As correctly pointed out in the HealthGrid whitepaper [9], although the ultimate goal would be the creation of a single EU-wide HealthGrid comprising all eHealth resources, the development path will include a number of application- or community-specific, independent Grids. Currently available middleware are still lacking many of the security and privacy enforcing features needed by biomedical (and even more by eHealth) applications. The choice of starting with a single Grid application, thus having to cope with only a relatively small number of sites and users, does reduce the number of security and privacy constraints. Another important constraint is imposed by the need to deploy Grid elements in hospitals, where the environment is often much different from the one usually found in research centers. Very few, if any, hospitals are willing to devote a large amount of resources (in terms of network, computing resources, and manpower) to the installation and maintenance of complex systems. Since this is a research project, the software to install is quickly evolving, and maintenance of such a system may be nontrivial. Thus, ease of deployment and maintenance is one of the most important constraints in the choice of the middleware to be used. In compute- and data-intensive applications on very large infrastructures (like HEP applications on the worldwide LHC Computing Grid), with several Virtual Organizations competing for resources, one of the issues is the distribution of jobs across sites, and the optimal usage of available resources. In our case, as in many other eHealth grid applications, the crucial point is a reliable and efficient data and metadata management, in this case for identifying suitable normal images across hospitals. As a basic choice, in order to avoid too much duplication of functionality and code, as many function as possible have been designed to be common to the two implementations; thus, for example, if the job splitting (see below) is done by the portal code and not by the JobSplitter (AliEn) or by writing a DAG (LCG), the piece of code is identical, with a switch governing the type of JDL file to be produced. 3.1. The AliEn-based implementation The following observations, along with the availability of an existing AliEn infrastructure for GPCALMA, the MAGIC-5 mammographic CAD project, suggested the choice of AliEn as one of the technologies for the prototype service, currently being developed:
76
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
x x x x x
A standard AliEn site can be installed in less than an hour on a single machine by an inexperienced user, thus allowing for very fast deployment of prototype sites. AliEn has an integrated data and metadata catalogue that has been tested and used by the ALICE collaboration for a few years, as well as by the MAGIC-5 mammography project GPCALMA [10]. the AliEn data access implementation relies on the widely used xrootd [11] protocol. AliEn provides tight integration with ROOT [12], the data analysis C++ framework adopted by the MAGIC-5 collaboration for software development, and with PROOF [13], the Parallel Root Facility. AliEn is interfaced with the LCG/gLite infrastructure and middleware, thus allowing in the future, should the need arise, to integrate our prototype in a larger system based on a different technology.
However, in the assumption that middleware services and their interfaces will evolve, we designed our interface in such a way to minimize it and keep it as modular as possible. The VO server is hosted at INFN Torino on hardware shared with other MAGIC-5 applications (mammogram and lung CT analysis). It runs the users and configuration databases, along with the central data and metadata catalogue. Storage Elements for the prototype deployment are currently installed at BIOLAB Genova and INFN-Torino. 3.1.1. AliEn-based Data Management AliEn provides an integrated solution that offers data and metadata management services for a Virtual Organisation. Its performance, up to several million entries in the catalogues, is being continuously and thoroughly tested by the ALICE Collaboration [14]. The integrated data and metadata catalogue comprises two layers, a central catalogue and distributed Local File Catalogues. The central file catalogue holds the correspondence between a Logical File Name (LFN), a unique identifier (GUID) and a list of Storage Elements with a replica of the file. The LFN syntax is a filesystem-like tree structure, which is mirrored in the DB structure with tables representing nodes (logical directories) which can hold pointers to leaves (file entries) or other nodes: this approach has also the advantage of providing an easily browsable structure. In AliEn, metadata management is built in the central catalogue DB structure, with no need of an extra product. Metadata are implemented by further tables, linked to the relevant logical tree nodes. Privacy of sensitive data can be ensured (as is done, e.g., in the GPCALMA project) by separating the data into two different subtrees, with specific access privileges for different users. In the second layer, correspondences between GUIDs and Physical File Names (PFNs, in the form of Storage URLs) are stored, using a distributed approach in which the DB is kept local to the site hosting the relevant SE. Thus the load on the central services is reduced and the system allows local management of the physical storage; alternatively, the site catalogues can be centralised using more tables in the central DB, in the case the
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
77
remote site does not provide a DB service. Remote catalogue implementations can be based on a number of DB backends, including the LCG File Catalogue (LFC). File access can be implemented by simply pre-staging the file from the local SE to the WN (which, in this deployment, is always on the same network): this can be done automatically by the Alien JobWrapper. Alternatively, and specially if the job is run via PROOF, the xrootd access protocol and POSIX-like APIs can be used to remotely access files without moving them from the SE. For our application, a dedicated service (with MySQL backend for the central services) was deployed on the MAGIC-5 Server, which can be accessed through regular AliEn clients (e.g. the AliEn shell extensions) or via application-specific GUIs which can be integrated in access portals. Two tasks are to be performed on the Data Catalogue from the web portal running on the management node: x query to the metadata catalogue to find images relevant to the current statistical analysis x select which images to use, find out their Physical File Names for access and possibly download the images if the remote centre allows it (while this is not part of the described use case, it is useful to have the functionality available - e.g. for debugging). Both of the functionalities have been implemented as perl scripts, which can be used either as independent command-line tools or integrated with the web portal, thus hiding atomic native catalogue functions (even if the internal language for the portal engine is python, exploiting the AliEn perl native APIs justified the small extra effort for integration). Integration with the portal allows seamless GUI interaction with the available functionalities from the catalogue, the analysis software and portal services. 3.1.2. PROOF-based Analysis Once the data management service provided the required information about the input images, the SPM analysis algorithm can be started. It is possible to opt for a batch analysis, by sending a set of one job per remote image, as described in the LCG-based implementation (see next paragraph). The configuration of the individual jobs can be either done by having the system generate a set of JDL files, or exploiting the AliEn “job splitting” feature. Alternatively one can make use of PROOF [13] for a distributed interactive analysis. A very small PROOF cluster, with 3 nodes on 2 different domains, was configured in order to implement and test the functionality. The access takes place on the master node and goes through the ROOT shell. The output of a query to the data management services (a list, each entry being the site and the physical file name of a selected image) is used to dynamically generate the analysis script from a template that implements the analysis algorithm that compares the images. The script is executed in parallel on different input files stored on the three sites and the results are sent back to the master node.
78
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
Presently, a full integration with the WEB Portal is not available yet. When testing will be completed, the node that hosts the WEB Portal will also become the PROOF master node, and the interactive analysis will be triggered by a user request on a WEB form. 3.2. The LCG-based implementation The LCG project provides a series of sites and services spread all over the world on which LCG and gLite middleware are installed and several Virtual Organisations are enabled. LCG middleware allows to effectively couple a wide variety of machines, including supercomputers, storage systems, data sources, and to create an uniform interface for connecting heterogeneous data resources over a network and for accessing data and metadata. As a subset of this community, a test-bed entirely dedicated to disseminate the potentialities of grid computing, named GILDA, has been deployed by the EGEE project [15]. A LCG node has been installed at BioLab laboratory, University of Genoa, and is now an official site of the GILDA test-bed for biomedical applications. The objectives of the LCG implementation of the above described SPM application are: • to distribute PET/SPECT images on different storage resources available on the GRID and register them on a catalogue. • to associate metadata to images in order to search and select the images for comparison using their own attributes. • to access images from User Interface using logical file names (LFN) without copying them on Worker Nodes. 3.2.1. LCG-based Data Management The LFC (LCG File Catalog) was selected: it allows users and applications to locate files (or replicas) on the LCG Grid maintaining mappings between logical and physical file names. As next step we integrated AMGA (ARDA Metadata Grid Application, [16]), a component fulfilling also the second requirement. Actually, LCG does not provide a satisfactory metadata management system and AMGA fills this hole. The collected metadata are associated to files stored on the LCG Grid through a reference on the LFC catalogue system and are used to select images directly through the portal. An important feature provided by AMGA is the ability to allow only certain people access to specified attributes. This is very important because all medical data should be considered as sensitive to preserve patient privacy. Furthermore, on a grid the distributed nature of data makes security problems more sensitive. In this context the federation and proxying functionalities provided by AMGA are very important because they allow to leave highly confidential data in the original places (hospitals) and to avoid copies of data in other database backends.
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
79
To meet the third requirement, LCG Data Management and File access tools have been used. In order to understand the architecture of the LCG implementation, the different APIs available for Data Management operations in LCG-2 are shown in Figure 5.
Figure 5. Available LCG APIs for Data Management.
lcg util is a C Application Program Interface (API) that provides the same functionality as the lcg-* commands (lcg-utils). This layer should cover most basic needs of user applications. It transparently interacts with the LFC catalogue and makes use of the correct protocol for file transfer. Grid File Access Library (GFAL ) [17] provides calls for catalogue interaction, storage management and file access and can be very handy when an application requires access to some part of a big Grid file but does not want to copy the whole file locally. The library hides the interactions with the LCG-2 catalogues and the SEs and SRMs and presents a POSIX-like interface for the I/O operations on the files. GFAL accepts GUIDs, LFNs, SURLs and TURLs as file names, and, in the first two cases, it tries to find the closest replica of the file. Depending on the type of storage where the file’s replica resides in, GFAL will use one protocol or another to access it. GFAL can deal with GSIFTP, secure and insecure RFIO, or gsidcap in a transparent way for the user. In order to make use of the above-mentioned LCG tools, the application code has been modified and structured in the following way: 1.
Registration and storage of data files (PET/SPECT images) on Storage Elements is available using lcg_utils. 2. AMGA has been used to insert metadata and to interact with the Portal in order to make the user able to select the normal images for the statistical analysis. 3. Development of a C program which makes use of the GFAL API in order to access distributed images using their Logical File Names and to extract some information necessary to SPM analysis without copying them locally.
80
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
4.
Job Submission: creation of a JDL file to submit the executable (and not the images) to the GRID. Due to the sequential nature of the process, the job can be splitted into a number of smaller sub-jobs in order to execute them in parallel and directly on the CEs closest to the SEs where images reside. We are also evaluating the possibility to adopt DAG solution for job submission and synchronization but it still needs an assessment work. 5. Statistical Analysis: running of final SPM analysis steps on results obtained by remote jobs. Statistical analysis is performed locally, outside of the Grid environment.. Figure 6 represents the application GRID infrastructure of LCG implementation.
Figure 6. Structure of the grid implementation.
4. Conclusions and outlook A new approach for the implementation of SPM-based early Alzheimer’s disease diagnosis is described. It leverages the functionalities provided by Grid computing and data services to gain access to a distributed database of normal images. Two implementations, respectively based on AliEn and LCG middleware, were developed, deployed and tested. Both provide the required functionalities: a detailed description of the features of the two approaches is given in the previous sections. Both implementations provide methods and tools for accessing remote distributed data satisfying security issues, for extracting information from those data without moving files
S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation
81
on the net, for managing related metadata, for building and maintaining catalogues, for submitting jobs on the Grid together with all the needed parameters. One of the more clear advantages of AliEn in a medical environment is its ease of installation and maintenance. On the opposite, the wider base of users and applications (as well as being the outcome of a larger project) makes LCG a reliable middleware providing a large set of resources in a production-grade environment with a complete documentation and an effective technical support. As a next step, the access to application through a Grid based portal will be provided. A more detailed comparison about computational performances is also planned.
Acknowledgments The authors gratefully acknowledge the support provided by the MAGIC-5 collaboration funded by INFN and the Grid.it project funded by MIUR. Thanks to the Gilda Team at INFN Catania for their invaluable support.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
K. Herholz et al., “Discrimination between Alzheimer Dementia and Controls by Automated Analysis of Multicenter FDG PET,” NeuroImage 17 (2002) 302–316 K.J. Friston., “Statistical Parametric Mapping and Other Analysis of Functional Imaging Data.” In Brain Mapping The Methods, pages 363-385. Academic Press, 1996. S. Scaglione et al., “Neuroinformatics portal as knowledge repository and e-service for neuroapplication and data mining,” proceedings of Medicon 2004, Napoli, August 2004 Zope Corporation, Inc. http://www.zope.org/ J. Basney, M. Humphrey, and V. Welch., “The MyProxy Online Credential Repository”. Software: Practice and Experience, 35 (2005) 801-816. A. Delgado Peris et al., “LCG-2 User Guide” EGEE EC Project, see http://egee.itep.ru/User_Guide.html http://public.eu-egee.org/; http://glite.web.cern.ch/glite/ P. Saiz et al., “AliEn - Alice Environment on the Grid,” Nucl. Instrum. Meth., A502 (2003) 437-440. The HealthGrid Association, HealthGrid Whitepaper, see http://whitepaper.healthgrid.org (2004) P. Cerello et al., “GPCALMA: a Grid-based Tool for Mammographic Screening,” Methods Inf. Med. 44 (2005): 244-8. A. Hanushevsky, “The Next Generation Root File Server,” proceedings of CHEP04, Interlaken, September 2004. R. Brun, F. Rademakers, “ROOT - An Object Oriented Data Analysis Framework,” proceedings of the AIHENP'96 Workshop, Lausanne, Sep. 1996, Nucl. Inst. Meth. A389 (1997) 81-86. See also: http://root.cern.ch M. Ballintijn et al., “The PROOF Distributed Parallel Analysis Framework based on ROOT,” proceedings of CHEP03, La Jolla, March 2003. http://aliceinfo.cern.ch/ https://gilda.ct.infn.it N. Santos and B. Koblitz, “Metadata services on the grid,” proceedings of ACAT'05, Zeuthen, Berlin, May 2005 See http://grid-deployment.web.cern.ch/grid-deployment/gis/GFAL/gfal.3.html
82
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Using the Grid to Analyze the Pharmacokinetic Modelling after Contrast Administration in Dynamic MRI Ignacio Blanquera, Vicente Hernándeza, Daniel Monleónb, José Carbonella, David Moratala, Bernardo Celdab, Montse Roblesa, Luis Martí-Bonmatíc a
Universidad Politécnica de Valencia, Valencia, Spain Departamento de Química Física, Universitat de València, Valencia, Spain c Servicio de Radiología, Hospital Universitario Dr. Peset, Valencia, Spain b
Abstract. The analysis of the angiogenesis in hepatic lesions is an important marker of tumour aggressiveness and response to therapy. However, the quantitative analysis of this fact requires a deep knowledge of the hepatic perfusion. The development of pharmacokinetic models constitutes a very valuable tool, but it is computationally intensive. Moreover, abdominal imaging processing increases the computational requirements since the movement of the patient makes images in a time series incomparable, requiring a previous pre-processing. This work presents a Grid environment developed to deal with the computational demand of pharmacokinetic modelling. This article proposes and implements a four-level software architecture that provides a simple interface to the user and deals transparently with the complexity of Grid environment. The four layers implemented are: Grid Layer (the closest to the Grid infrastructure), the Gate-toGrid (which transforms the user requests to Grid operations), the Web Services layer (which provides a simple, standard and ubiquitous interface to the user) and the Application Layer. An application has been developed on top of this architecture to manage the execution of multi-parametric groups of co-registration actions on a large set of medical images. The execution has been performed on the EGEE Grid infrastructure. The application is platform-independent and can be used from any computer without special requirements.
1. INTRODUCTION AND MOTIVATION The liver is the largest organ of the abdomen and there are a large number of lesions affecting it. Both benign and malignant tumours arise within it. The liver is also the target organ for most solid tumours metastasis. Angiogenesis is quite an important marker of tumour aggressiveness and response to therapy. Even more, the presence of chronic inflammatory change affects a large proportion of the population The blood supply to the liver is derived jointly from the hepatic arteries and the portal venous system. Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCEMRI) is extensively used for the detection of primary and metastasis hepatic tumours. However, the assessment of early stages of the malignancy and other diseases like cirrhosis require the quantitative evaluation of the hepatic arterial supply. To achieve this goal, it is important to develop precise pharmacokinetic approaches to the analysis of the hepatic perfusion. The influence of breathing, the large number of
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
83
pharmacokinetic parameters and the fast variations in contrast concentration, in the first moments after contrast injection, reduce the efficiency of traditional approaches. On the other hand, the traditional radiological analysis requires the acquisition of images covering the whole liver, which greatly reduces the time resolution for the pharmacokinetic curves. The combination of all these adverse factors makes the analytical study of liver DCE-MRI data very challenging.
2. STATE OF THE ART The current use of Internet as main infrastructure for the integration of information through web based protocols opened the door to new possibilities. The Web Services (WS) are one of the most consolidated technologies in web environments. They are based on the Web Services Description Language (WSDL), which defines the interface and constitutes a key part of the Universal Description Discovery and Integration (UDDI) [7]. WSs communicate through the Simple Object Access Protocol (SOAP) [8], a simple and decentralized mechanism for the exchange of typed information structured in XML (Extended Mark-up Language). As is defined in [9] a Grid provides an abstraction of resources for sharing and collaborating through different administrative domains. These resources can be hardware, data, software and frameworks. The key concept of Grid is the Virtual Organization (VO) [10], defined as a temporal or permanent set of entities or groups that provide or use resources. The usage of Grid Computing is currently in expansion. In this process of development, many basic middlewares such as the different versions of Globus Toolkit [10] (GT2, GT3, GT4), Unicore [15] or InnerGrid [16] have arisen. At present, Grid technologies are converging towards Web Services technologies. The Open Grid Services Architecture (OGSA) [11] represents an evolution in this direction. OGSA seems to be an adequate environment for obtaining efficient and interoperable Grid solutions; some issues (such as the security) still need to be improved. Globus GT3 implemented OGSI (Open Grid Service Infrastructure) which was the first implementation of OGSA. OGSI was deprecated and substituted by the implementation of OGSA by the Web Services Resource Framework (WSRF) [18] in GT4. WSRF is totally based in WSs. Although there are newer versions, Globus GT2 is a well-stablished batch basic Grid platform which has been extended in several projects in a different way in which GT3 and GT4 have evolved. The DATAGRID project [12], developed the EDG (European Data Grid), a Middleware based on GT2, which improved the support of distributed storage, VO management, job planning and job submission. The EDG middleware has been improved and extended in the LCG (Large Hadron Collider Computing Grid) [2] and Alien Projects to fulfil the requirements of the High Energy Physics community. Another evolution of the EDG is gLite [14], a Grid Middleware based in WS and developed in the frame of the Enabling Grids for E-sciencE (EGEE) project [14]. gLite has extended the functionality and improved the performance of critical resources, such as the security, integrating the Virtual Organisation Membership System (VOMS) [13] for the management of VOs. VOMS provides information on the user's relationship with the Virtual Organization defining groups, roles and capabilities. These middlewares have been used to deploy Grid infrastructures comprising thousands of resources, increasing the complexity on the usage of the Grid. However,
84
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
the maturity of these infrastructures in terms of user-friendliness is not sufficient yet. Configuration and maintenance of the services and resources or fault tolerance is hard even for experimented users. Programming Grid applications usually involve a nontrivial degree of knowledge of the intrinsic structure of the Grid. This article presents a software architecture that abstracts the users from the management of Grid environments by providing a set of simple services. Although the architecture proposed is open to different problems, this article shows the use for the implementation of an application for the co-registration of medical images. This application is oriented to either medical end-users or researchers for a large number of executions using different values for the parameters that control the process. Researchers can tune-up the algorithms by executing larger sets of runs, whereas medical users can obtain the results without requiring powerful computers. The application does not require a deep knowledge of the Grid environments. It offers a high-level user-friendly interface to upload data, submit jobs and download results without requiring knowing the syntax of commands, Job Description Language (JDL) data and resource administration or security issues. 2.1. Pharmacokinetic Modelling The pharmacokinetic modelling of the images obtained after a quick administration of a bolus of extra-cellular gadolinium chelates contrast can have a deep impact on the diagnosis and the evaluation of different pathogen entities. Pharmacokinetic models are designed to forecast the evolution of an endogenous or exogenous component on the tissues. To follow-up the evolution of the contrast agent a sequence of MRI volumetric images is obtained at different times following the injection of contrast. Each of these images comprises a series of image slices that cover the body part explored.
Figure 1: The MR pulse sequence includes 24 slices covering the whole liver (a). In (b) there are depicted the first 9 images of the dynamic acquisition.
The pharmacokinetic model considers that the liver receives contrast through the hepatic artery and the portal vein. Each input flow is determined by a parameter kai for the arterial flow and kpi for the portal vein. The total amount of contrast delivered to the liver at the timestamp ‘t’ depends on the concentration of contrast in the hepatic artery (Ca(t)) and on the portal vein (Cp(t)) and the values of those constants. The result will be the concentration at the liver (Cl(t)). The liver also outputs contrast at a rate defined by klo. Next figure shows a schema of this process and the equations that drive it.
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
85
Figure 2: Pharmacokinetic model
The known values on the model are the concentrations of contrast (obtained from the images) and the values to be obtained are the flow rates (kai, kpi, klo). The study of pharmacokinetic models for the analysis of hepatic tumours is an outstanding example of the above. However, since the whole acquisition process takes a few minutes, images are obtained in different break-hold periods. This movement of the patient produces artefacts that make images directly incomparable. This fact is even more important in the area of the abdomen, which is strongly affected by the breathing and the motility of the organs. A prerequisite for the computation of the parameters that govern the model is the reduction of the deformation of the organs in the obtained images. This process can be performed by co-registering all the volumetric images with respect to the first one. 2.2. Co-registration The co-registration of images consists on aligning the voxels of two or more images in the same geometrical space by using the necessary transformations to make the floating images as much as possible similar to the reference image. In general terms, the registration process could be rigid or deformable. Rigid registration only uses affine transformations (displacements, rotations, scaling) to the floating images. Deformable registration enables the use of elastic deformations on the floating images. Rigid registration introduces fewer artefacts, but it can only be used when dealing with body parts in which the level of internal deformation is lower (e.g. the head). Deformable registration could introduce unrealistic artefacts, but is the only one that could compensate the deformation of elastic organs (e.g. in the abdomen). Image registration can be applied in 2D (individually to each slice) or in 3D. Registration in 3D is necessary when the deformation happens in the three axes. The co-registration process implemented in this case is based on the ITK software library [1]. This process includes in a first stage a rigid 3D registration of the Gaussian filtered volume images (Mutual Information Metric and Gradient Descent Optimizer), and a 3D deformable registration (Mutual Information Metric, Gradient Descent Optimizer and BSpline Transformation Transform).
86
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
2.3. Post processing Although the co-registration of images is a computationally complex process which must be performed before the analysis of the images, it is not the only task that needs high performance computing. Extracting the parameters that define the model and computing the transfer rates for each voxel in the space will require large computing resources. The platform implemented has been designed to cope with following post-processing in the same way.
3. ARCHITECTURE The basic Grid middleware used in this architecture is the LCG, developed in the LHC Computing Grid Project, which has a good support for high throughput executions. A four-layered architecture has been developed to abstract the operation of this middleware. The registration application has been implemented on top of this architecture. Medical data is prone to abuse and need careful treatment in terms of security and privacy. This is even more important when the data has to flow from different sites. It is crucial both to preserve the privacy of the patients and to ensure that people accessing the information are authorised to do so. The way in which this environment guarantees the security is referenced in other publications [1]. 3.1. Layers As it has been mentioned in previous sections, the development of this framework has been structured into four layers, thus providing a higher level of independence and abstraction from the specificities of the Grid and the resources. The following sections describe the technology and the implementation of each layer. Figure 3 shows the layers of the proposed architecture. Application: EGEE Registration Launcher
Web-Services Middleware
WS Container
FTP Server
Gate-to-Grid User Interface
EGEE Grid Infrastructure
Figure 3: The proposed architecture.
3.1.1. Grid Layer The system developed in this work makes use of the computational and storage resources being deployed in the EGEE infrastructure along a large number of computing centres distributed among different countries. EGEE currently uses LCG, although there are plans for migrating to gLite. This layer offers the “single computer” vision of the Grid through the storage catalogues and workload management services that tackle with the problem of selecting the rightmost resource.
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
87
The Job Description Language (JDL) is the way in which jobs are described in LCG. A JDL is a text file specifying the executable, the program parameters, the files involved in the processing and other additional requirements. A description of the four layers of the architecture is provided along the following subsections. 3.1.2. Gate-to-Grid Layer The Gate-to-Grid Layer constitutes the meeting point between the Grid and the Web environment. In this layer there are WSs providing the interaction with the Grid similarly as if the user were directly logged in the UI. The WSs are deployed in a Web container in the UI which provides this mediation. The use of the Grid is performed through the UI by a set of scripts and programs which have been developed to ease the task of launching executions and managing group of jobs. The steps required to execute a new set of jobs in the Grid are the following: 1. A unique directory is created for each parametric execution. This directory has separated folders to store the received images to be co-registered, the JDL files generated and the output files retrieved from the jobs. It also includes several files with information about the jobs, such as job identifiers and parameters of the registration process. 2. Files to be registered are copied to a specific location in this directory. 3. For each combination of parameters and pair of volumes to be registered, a JDL file filled-in with the appropriate values is generated. 4. Files needed by the registration process are copied in the SE and registered in the RC and the RLS. 5. The jobs are submitted to the Grid through an RB that selects the best available CE according to a predefined criterion. 6. Finally, when the job is done and retrieved, folders and temporal files are removed from the UI. The files registered in the SE that are no longer needed are also deleted. The different Grid Services are offered through the scripts and programs aforementioned. These programs work with the UI instructions in order to ease the tasks for job and data management. The access to these programs and scripts is remotely available through the WSs deployed in the UI. The copying of the input files from the computer where the user is located to the UI computer is performed through FTP (File Transfer Protocol). The most important WSs offered in the UI are: InitSession. This service is in charge of creating the proxy from the Grid user certificates. The proxy is then used in the Grid environment as a credential of that user, providing a single sign-on for the access to all the resources. GetNewPathExecution. As described before, jobs launched in a parametric execution (and not yet cleared) will have in the UI their own group folder. This folder has to be unique for each group of jobs. This service will get a unique name for each job group and it will create the directory tree to manage that job execution. This directory will store the image, logs, JDLs and other information files.
88
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
Submit. The submit call starts an action that carries on the registration of the files from the UI to the SE, creates the JDLs according to the given registration parameters and the files stored on the specified directory of the UI. It finally submits the jobs to the Grid using the generated JDL files. GetInformationJobs. This service gets information about the jobs belonging to the same execution group. The information retrieved by this call is an XML document with the job identifiers and the associated parameters. CancelJob. This call cancels a single job (part of an execution group). The cancellation of a job implies the removal of the registered files on the Grid. CancelGroup. This service performs the cancellation of all the jobs launched to the Grid from a group of parametric executions. As in the case of CancelJob, the cancellation of jobs implies the removal of its associated files from the SEs. Moreover, in this case the temporal directory created on the UI is also removed when all the jobs are cancelled. GetJobStatus. This service informs about the status of a job in the Grid, given the job identifier. The normal sequence of states of a job is: submitted, waiting, ready, scheduled, running, done and cleared. Other possible states are aborted and cancelled. PrepareResults. This service is used to prepare the results of an execution before downloading them. When a job finishes, the resulting image, the standard output and the standard error files can be downloaded. For this purpose the PrepareResults service retrieves the results from the Grid and stores them in the UI. The executable must exist in the UI system and it has to be statically compiled so that it can be executed without library dependencies problems in any machine of the Grid. The implemented registration of this project is based on the Insight Segmentation and Registration Toolkit (ITK) [4] software library. ITK is an Open Source software library for image registration and segmentation. 3.1.3. Middleware Web Services Layer The Middleware Web Services Layer provides an abstraction to the use of the WSs. The abstraction of the WSs Layer has two purposes. On one hand, to create a unique interface independent from the application layer, and on the other hand to provide methods and simple data structures to ease the development of final applications. The development of a separate software library for the access to the WSs will ease future extensions for other applications that share similar requirements. Moreover it will enable introducing optimizations in this layer without necessarily affecting the applications developed on top of this layer. Moreover, this layer offers a set of calls based on the Globus FTP APIs to perform the data transferring with the Gate-to-Grid Layer. More precisely, the abstraction of the WSs lies, in first place, on hiding the creation and management of the necessary stubs for the communication with the published WSs. In second place, this layer manages the data obtained by the WSs by means of simple structures closer to the application. From each of the available WSs in the Gate-to-Grid layer there exists a method in this layer that gets the information in XML given by the WSs and returns that information in basic types or structured objects which can be managed directly by the application layer.
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
89
3.1.4. Application Layer This layer is the one that offers the graphical user interface which will be used for the user interaction. This layer makes use of functions, objects and components offered by the middleware WS layer to perform any operation on the Grid. The developed tool has the following features available: x
Parameter profile management. The management of the parameters allows creating, modifying and removing configurations of parameters for the launching of multi-parametric registrations. These registration parameters are defined as a rank of values, expressed as by three values: initial value, increment and final value. The profiles can be loaded from a set of templates or directly filled-in before submitting the co-registrations according to these parameters.
x
Transferring of the volumetric images that are going to be registered. The first step is to upload the images that are going to be registered. The application provides the option to upload the reference image and the other images that will be registered. These files are automatically transferred to the UI to be managed by the Gate-to-Grid layer.
x
Submission of the parametric jobs to the Grid. For each combination of the input parameters, the application submits a job belonging to the same job group. The user can assign a name to the job group to ease the identification of jobs in the monitoring window.
x
Job monitoring. The application offers the option to keep track of the submitted jobs for each group.
x
Obtaining the results. When a Grid execution has reached the state done, the user can retrieve the results generated by the job to the local machine. The results include the registered image, the standard output and standard error generated by the program launched to the Grid. The user can also download the results from a group of jobs automatically.
4. RESULTS The first result of this work has been the LCG Registration Launcher tool, which has been developed on top of the architecture described in this article. Figure 4 shows two screenshots of the application, one showing the panel for uploading reference and floating volumes, and the other one showing the panel for the monitoring of the launched jobs.
90
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
Figure 4: Screenshots of the LCG registration launcher application.
The results obtained can be considered in terms of performance and scientific results. The results presented in this section are related to the images from a clinical trial with 20 patients obtained at the Hospital Dr. Peset for this work. Considering the performance, the required time to perform a registration of a volumetric image in a PIII at 866 Mhz with 512 MB of RAM is approximately 1 hour and 27 minutes. Considering that the complete study performed involved 20 patients the total cost would be 2331h 22m. Using a 20-procs computing farm the complete process took 132h 50m. The computational cost using the Grid was 17h 35m, using the resources of the EGEE grid. More than 200 computers were available, but since the system is shared with other users, several cases had to wait on local queues. Moreover, several jobs failed and needed to be rescheduled. If the same resources were used in a batch processing approach, running manually the jobs on the computing farm, the computing time would be an 8% shorter. The overhead of Grids is due to the use of secure protocols, remote and distributed storage resources and the scheduling overhead, which is in the order of minutes due to the monitoring policies which are implemented in a poll fashion. Regarding the co-registration results obtained, Figure 5 shows a tiled composition of two images before the co-registration (a) and after the process (b). The figure clearly shows the improvement in the alignment of the voxels of the image. Clear differences are observed on the top of the abdomen and on the area of the ribs.
Figure 5: Images before coregistration (left) and after coregistration (right)
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
91
Finally, and considering the final output of the whole perfusion analysis process, figure 6 shows a parametric image obtained as a function of the parameters that drive the model (voxel concentration versus arterial concentration). This image has been obtained solving the overdertermined system of equations of the pharmacokinetic model described in figure 2 and using as input the concentrations of the contrast in the different slices of the co-registered images. The values on the intensity of the pixels are defined by the value of kLo in each voxel.
Figure 6: Final Result: Parametrical Image.
5. CONCLUSIONS AND FURTHER WORK The final application developed in this work offers an easy-to-use high level interface that allows the use of the LCG2-based EGEE Grid infrastructure for image co-registration by Grid-unaware users. With the use of the tool described in this work, the user achieves a large computational performance for the co-registration of radiological volumes and the evaluation of the parameters involved. The Grid is an enabling technology that provides the clinical practice of processes with processes that, by its computational requirements, were not feasible with a conventional approach. It also offers a high throughput platform for medical research. The proposed architecture is adaptable to different platforms and enables the execution of different applications changing the user interface. This work is a starting point for the realization of a middleware focused on the abstraction of the Grid to ease the development of interfaces for the submission of complex jobs to the Grid. The developed application is a part of a larger project. As introduced in section 1, the co-registration application is a first step of the pharmacokinetic model identification. The next step to be treated will be the extraction of the pharmacokinetic model from a set of acquisitions. For this task, it was necessary to have the co-registration tool that has been developed in this work. Finally, the middleware WS layer will be enlarged to give support to new functionalities related with the extraction of the pharmacokinetic model. Finally, the application currently supports the Analyze format [5], although the extension to other formats such as DICOM [4] is being considered among the priorities.
92
I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling
6. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
L. Ibañez, W. Schroeder, L. Ng, J. Cates, “The ITK Software Guide”, second edition, 2005, http://www.itk.org, Jan 2006. I. Blanquer, V. Hernández, D. Segrelles, “A Framework Based on Web Services and Grid Technologies for Medical Image Registration”, Lecture Notes in Computer Science, Biological and Medical Data Analysis: 6th International Symposium, (ISBMDA), ISSN: 0302-9743, vol 3745, pp 22-33, 2005. "LHC Computing Grid", http://lcg.web.cern.ch/LCG, Jan 2006 Ibanez, Schroeder, Ng and Cates, “The ITK Software Guide”, Edited by Kitware Inc, ISBN 1-93093410-6 National Electrical Manufacturers Association, “Digital Imaging and Communications in Medicine (DICOM)”, 1300 N. 17th Street, Rosslyn, Virginia 22209 USA Mayo Clinic; “Analyze 7.5 File Format” Scott Short, “Creación de Servicios Web XML para la Plataforma .NET", Mc-Graw-Hill",ISBN 8448137027, 2002 "Universal Description, Discovery and Integratin (UDDI)", http://www.uddi.org, Jan 2006 "Simple Object Access Protocol (SOAP)", http://www.w3c.org, Jan 2006 Expert Group Report, “Next Generation Grids”, edited by the European Commission, 2004, http://www.cordis.lu/ist/grids/index.htm, Jan 2006 I. Foster and C. Kesselman “The GRID: Blueprint for a New Computing Infraestructure”, Edited by Morgan Kaufmann Publishers, Inc, 1998 I. Foster and C. Kesselman and J. Nick and S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, The Globus Project, 2002, http://ww.globus.org/research/papers/ogsa.pdf.", Jan 2006 “The DATAGRID Project", http://www.eu-datagrid.org, Jan 2005 Virtual Organization, http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/voms.html, Jan 2006 “gLite. Lightweight Middleware for Grid Computing”, http://glite.web.cern.ch/glite, Jan 2006 Dietmar Erwin, "Unicore Plus Final Report",2003 "InnerGrid Users' manual", Edited by GridSystems, 2003 “Enabling Grids for E-sciencE”, http://www.eu-egee.org, Jan 2006. “The Globus Alliance”, http://www.globus.org, Jan 2006
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
93
Medical image registration algorithms assesment: Bronze Standard application enactment on grids using the MOTEUR workflow engine Tristan Glatard a , Johan Montagnat b , and Xavier Pennec c a CNRS, I3S laboratory b CNRS, I3S laboratory c INRIA Sophia Antipolis Abstract. Medical image registration is pre-processing needed for many medical image analysis procedures. A very large number of registration algorithms are available today, but their performance is often not known and very difficult to assess due to the lack of gold standard. The Bronze Standard algorithm is a very data and compute intensive statistical approach for quantifying registration algorithms accuracy. In this paper, we describe the Bronze Standard application and we discuss the need for grids to tackle such computations on medical image databases. We demonstrate MOTEUR, a service-based workflow engine optimized for dealing with data intensive applications. MOTEUR eases the enactment of the Bronze Standard and similar applications on the EGEE production grid infrastructure. It is a generic workflow engine, based on current standards and freely available, that can be used to instrument legacy application code at low cost.
1. The Bronze Standard application Computerized medical image analysis is now a well established area that provides assistance for diagnosis, modeling, and pathologies follow-up. With the growing inspection capabilities of imagers and the medical data production growth, the need for large amounts of data storage and computing power increases. Grids have been identified as a tool suitable for dealing with medical data. Successful example of grid application deployment for image databases analysis, optimization of medical image algorithms, simulation, etc, have already been reported [7]. 1.1. Medical images registration Medical image registration algorithms are playing a key role in a very large number of medical image analysis procedures. Together with image segmentation algorithms, they are fundamental processings often needed prior to any subsequent
94
T. Glatard et al. / Medical Image Registration Algorithms Assessment
analysis. Image registration consists in searching a 3D transformation between two images, so that the first one can superimpose on the second one in a common 3D frame. The transformation may be rigid (the composition of a translation and a rotation) to express a 3D change of frame or non rigid to express local deformations of space. A rigid registration is useful for aligning similar data (such as images of a same patient acquired at different times) into a single frame. A non-rigid registration is useful for computing the deformation map between different data (such as data acquired from two different patients). In addition, the registration is said mono-modal when both images have been acquired using the same imaging modality (thus sharing some common signal characteristics) or multi-modal when the modalities differ (signal differences have then to be compensated for). The computational load of these algorithms greatly varies depending on the type of registration computed, the size of the images to process, and the algorithms themselves. In general non-rigid, multi-modal algorithms are more costly than rigid, mono-modal algorithms. On typical 3D images and using up-to-date PCs, the computation time varies from a few minutes in the simplest cases to tens of hours in most compute intensive registrations. 1.2. Registration algorithms assessment Given the very common use of registration algorithms and the different contexts for their application, a large number of new algorithms is developed by the research community. There are approximately a hundred of new research papers published on that subject each year. A difficult problem, as for many other medical image analysis procedures, is the assessment of these algorithms robustness, accuracy and precision [4]. Indeed, there is no well established gold standard to compare to the algorithm results. Different approaches have been proposed to solve this issue. It is possible to synthesize images by simulating the acquisition physics and to experiment the algorithm on the synthetic images produced [1]. However, realistic images are difficult to produce and hardly perfect enough for fine assessment of the algorithms. Phantoms (manufactured objects with properties close to human tissues for the imaging modality studied) can also be used to acquire test images. However, it is also very difficult to manufacture realistic enough phantoms. 1.3. The Bronze Standard method An alternative for assessing registration algorithms is a statistical approach called the Bronze Standard [9]. The goal is basically to compute the registration of a maximum of image pairs with a maximum number of registration algorithms so that we obtain a largely overestimated system to relate the geometry of all the images. It makes this application very compute and data-intensive. Suppose that we have n images of the same organ of one patient and m registration algorithms. We have in fact only n−1 free transformations to estimate that relate all these images, say T¯i,i+1 . The transformation between images i and j is obtained using a compositions such as T¯i,j = T¯i,i+1 ◦ T¯i+1,i+2 ◦ . . . ◦ T¯j−1,j if i < j (or the inverse of both terms if j > i). The free transformation parameters are computed by minimizing the prediction error on the observed registrations:
T. Glatard et al. / Medical Image Registration Algorithms Assessment
min
T¯1,2 ,T¯2,3 ,...,T¯n−1,n
2 k d Ti,j , T¯i,j
95
(1)
i,j∈[1,n],k∈[1,m]
k is the transformation computed between image i and j by the k th regiswhere Ti,j tration algorithm, and d is a distance function between transformations chosen as a robust variant of the left invariant distance on rigid transformation [11]. The estimation T¯i,i+1 of the perfect registration Ti,i+1 is called bronze standard because the result converges toward Ti,i+1 as the number of methods m and the number of images n become larger. Indeed, considering a given registration method, the variability due to the noise in the data decreases as the number of images n increases, and the registration computed converges toward the perfect registration up to the intrinsic bias (if there is any) introduced by the method. Now, using different registration procedures, based on different methods, the intrinsic bias of each method also becomes a random variable, which is hopefully centered around zero and averaged during the minimization procedure. The different biases of the methods are now integrated into the transformation variability. To fully reach this goal, it is important to use as many independent registration methods as possible. In this process, we do not only estimate the optimal transformations, but also the rotational and translational variance of the “transformation measurements”, which are propagated through the criterion to give an estimated of the variance of the optimal transformations. These variances should be considered as a fixed effect (i.e. these parameters are common to all patients for a given image registration problem, contrarily to the transformations) so that they can be computed more faithfully by multiplying the number of patients. An important variant of the Bronze Standard is to relax the assumption of the same variances for all algorithms, and to unbias their estimation. This can be realized by using only m − 1 out of the m methods to determine the bronze standard registration, and use the obtained reference to determine the accuracy of the last method. In this paper, we are considering m = 4 different registration algorithms in our implementation of the bronze standard method: (1) Baladin and (2) Yasmina are intensity-based. The former uses a block matching strategy while the later optimizes a similarity measure on the complete images using the Powel algorithm. (3) CrestMatch is a prediction-verification method and (4) PFRegister is based on the ICP algorithm. Both CrestMatch and PFRegister register features (crest lines) extracted from the input images. These algorithms are further described in [9]. Figure 1 illustrates the application workflow. Each box in figure 1 represents an algorithm and arrows show computation dependencies.
2. Enacting the application workflow on the EGEE production grid Even though registration computations are usually tractable on simple PCs, the large number of input data and registration algorithms needed to compute the bronze standard makes this method very compute intensive. A grid infrastructure can handle the load of the computations involved and help in managing the medical image database to process.
96
T. Glatard et al. / Medical Image Registration Algorithms Assessment
2.1. EGEE infrastructure In order to evaluate the relevance of our prototype and to compare real executions to theoretically expected results, we made experiments on the EGEE production grid infrastructure1 . This platform is a pool of thousands computers (standard PCs) and storage resources accessible through the LCG2 middleware 2 . The resources are assembled in computing centers, each of them running its internal batch scheduler. Jobs are submitted from a user interface to a central Resource Broker which distributes them to the resources available. On such a grid infrastructure, the application parallelism can be exploited to optimize the execution time. Several instances of each service will be concurrently submitted to the grid and executed on different processors. 2.2. Application workflow The Bronze Standard application is composed as a workflow of algorithms represented on figure 1. The two input image sources on top correspond to the image sets on which the evaluation is to be processed. The upper box corresponds to an initialization needed for the registration algorithms. Then come the registration algorithms themselves and format conversion and result collection services. Finally, the bottom (gray) service is responsible for the evaluation of the accuracy of the registration algorithms, leading to the outputs values of the workflow. It computes means from all the results of the registration services considered but one, and evaluates the accuracy of the specified registration method. This service has to be synchronized: it must be enacted only once every data have been processed in the workflow. The six services with a triple contour are compute intensive initialization and registration algorithms while the other boxes represent more lightweight computation steps such as data format transformations. 2.3. Medical workflows Similarly to the Bronze Standard application presented above, medical image analysis procedures are often not based on a single image processing algorithm but rather assembled from a set of basic tools dedicated to process the data, model it, extract quantitative information, and analyze results. Given that interoperable algorithms packed in software components with a standardized interface enabling data exchanges are provided, it is possible to build complex workflows to represent such procedures for data analysis. High level tools for expressing and handling the computation flow are therefore expected to ease computerized medical experiments development. When dealing with medical experiments, the user often needs to process datasets made of e.g. hundreds of individual images. The workflow management is therefore data driven and the scheduler responsible for sharing the load of computations should take into account the input data sets as well as the workflow graph topology. 1 Enabling
2 LCG2
Grids for E-sciencE, http://www.eu-egee.org middleware, http://lcg.web.cern.ch/LCG/activities/middleware.html
T. Glatard et al. / Medical Image Registration Algorithms Assessment
97
Figure 1. MOTEUR interface representation
3. MOTEUR workflow engine We implemented an hoMe-made OpTimisEd scUfl enactoR (MOTEUR) prototype to manage application workflows. MOTEUR is written in Java, in order to be platform independent. It is available under CeCILL Public License (a GPLcompatible open source license) at http://www.i3s.unice.fr/ glatard. The workflow description language adopted is the Simple Concept Unified Flow Language (Scufl) used by the Taverna workbench [10]. Figure 1 shows the MOTEUR web interface representing a workflow that is being executed. Each service is represented by a color box and data links are represented by curves. The services are color coded depending on their current status: gray services have never been executed; green services are running; blue services have finished the execution of all input data available; and yellow services are not currently running but waiting for input data to become available. MOTEUR is interfaced to the job submission interfaces of both the EGEE infrastructure and the Grid50003 experimental grid. In addition, lightweight jobs execution can be orchestrated on local resources. MOTEUR is able to submit different computing tasks on different infrastructures during a single workflow execution. 3.1. Service-based approach To handle user processing requests, two main strategies have been proposed and implemented in grid middlewares: 3 Grid5000,
http://www.grid5000.org
98
T. Glatard et al. / Medical Image Registration Algorithms Assessment
1. In the task based strategy, also referred to as global computing, users define computing tasks to be executed. Any executable code may be requested by specifying the executable code file, input data files, and command line parameters to invoke the execution. The task based strategy, implemented in GLOBUS [3], LCG2 or gLite 4 middlewares for instance, has already been used for decades in batch computing. It makes the use of non gridspecific code very simple, provided that the user has a knowledge of the exact syntax to invoke each computing task. 2. The service based strategy, also referred to as meta computing, consists in wrapping application codes into standard interfaces. Such services are seen as black boxes from the middleware for which only the invocation interface is known. The services paradigm has been widely adopted by middleware developers for the high level of flexibility that it offers. However, this approach is less common for application code as it requires all codes to be instrumented with the common service interface. The service-based approach is naturally very well suited for chaining the execution of different algorithms assembled to build an application. Indeed, the interface to each application component is clearly defined and the middleware can invoke each of them through a single protocol. In addition, the service-based approach offers a large flexibility for managing applications requiring the processing of complete image databases such as the Bronze Standard described above. The input data are treated as input parameters, and the service appears to the end user as a black box hiding the code invocation. When a service is dealing with two input data sets or more, the semantics of the service with regard to the data composition needs to be specified. MOTEUR implements two data composition patterns: • The one-to-one composition: each input of the first data set {A}i∈[1,m] is processed with each input of the second data set {B}i∈[1,n] , thus producing min(m, n) output data. • The all-to-all composition: all input of {A}i∈[1,m] are processed with all input of {B}i∈[1,n] , thus producing m × n output data. The use of these two composition strategies, embedded in the Scufl language, significantly enlarges the expressiveness of the workflow language. It is a powerful tool for expressing complex data-intensive processing applications in a very compact format. MOTEUR is implementing an interface to both Web Services [13] and GridRPC [8] application services. We developed an XML-based language to be able to describe input data sets. This language aims at providing a file format to save and store the input data set in order to be able to re-execute workflows on the same data set. 3.2. Enabling legacy codes In the service based approach, all application codes need to be wrapped into a standard service envelope. This increases the code complexity on the application 4 gLite
middleware, http://www.glite.org
T. Glatard et al. / Medical Image Registration Algorithms Assessment
99
developer side and this prevent the use of legacy code which cannot necessarily be modified and recompiled for various reasons. In order to face this limitation, we have developed a legacy code application wrapping service similar to GEMLCA [5]. The idea is to propose a standard web service capable of submitting any legacy executable on the target grid infrastructure. This generic application service, is dynamically composing the executable invocation command line before submission. For this purpose, it needs a description of the executable command line parameters. We have defined a simple XMLbased parameters description format. For each legacy code to gridify, the user only needs to produce the corresponding XML document. The generic service is taking as input both the executable and the description document. The generic application service is installed on the grid user interface and it does not require any deployment on the grid computing resources. It submits jobs to the grid through the standard workload management system. 3.3. Optimizing the execution of data intensive applications Some workflow managers, such as the CONDOR DAGMan 5 have adopted the task-based approach, coupling processings and data to be processed. This static and complete description of the graph of tasks to be executed eases the optimization of the workflow execution as it provides all information necessary for mapping the workflow and data to available resources (see for instance the Pegasus system [2]). However, it poorly deals with large data sets since a new task need to be explicitly written for each input data to be processed. In service-based workflow managers such as MOTEUR, Kepler [6], Taverna [10] or Triana [12], each processor is invoking external services whose data is dynamically transmitted as parameter. However, the services invocation is an extra layer between the workflow manager and the execution grid infrastructure. The workflow manager has no direct access to the grid resources and therefore it cannot directly optimize the job submissions scheduling. Performances are critical in the case of data-intensive applications and MOTEUR is implementing several optimization strategies to ensure optimal workflow execution by exploiting the massively parallel resources available on the grid infrastructure. Workflow parallelism. The workflow encompasses an inherent degree of parallelism as several independent services may be invoked in parallel asynchronously by the workflow engine. Data parallelism. The computations described in the workflow can be performed independently for each input data segment. When dealing with large input data sets, this is a considerable potential optimization that consists in processing all these data in parallel on different grid resources. Also quite obvious, the data parallelism is not straight forward to implement. Indeed, parallel execution over different data leads to loose computation sequences (a data can overtake another one in the workflow) and potential causality problem if the ordering is not reestablished. MOTEUR’ strategy to avoid this problem is to associate to each processed data segment a complete history tree of the former processings that unambigu5 CONDOR
DAGMan, http://www.cs.wisc.edu/condor/dagman
100
T. Glatard et al. / Medical Image Registration Algorithms Assessment
ously describes the data provenance. To deal with the all-to-all composition strategy, MOTEUR also keeps in memory all data segments sent to the input of each service. Thus, when a delayed data arrives it can be composed with all formerly identified input data by repetitive invocations of the service. Services parallelism. The computations of different services over different input data sets can overlap in time. Parallel computing of such tasks enables a pipelining optimization similar to the one exploited inside CPUs. Theoretically, this service parallelism should not bring an extra level of parallelism when data parallelism is exploited. If all data could be processed in parallel in constant time, there would be no overlap of successive services. In practice though, execution times on a loaded production infrastructure are highly variable and unpredictable. The desynchronization of the computations creates the need for service parallelism optimization. Jobs grouping. Finally, sequential jobs might be grouped and executed to lower the number of services invocation and minimize the grid overhead resulting from jobs submission, scheduling and data transfers. Jobs grouping is not feasible in general on a service-based infrastructure as services are completely independent and can only be invoked separately by the workflow engine. The internal logic of all services implemented through the generic wrapping service is known though. The workflow engine is thus capable of translating the calls to two consecutive generic services into a call to a single service submitting a compound job with two consecutive executable command line invocations. To our knowledge, MOTEUR is the first service-based workflow manager implementing all these levels of parallelism.
4. Results and conclusions MOTEUR is evaluated on the Bronze Standard application with a realistic experimental setting. We executed our workflow on different inputs data sets, with various sizes. Input image pairs are taken from a database of injected T1 brain MRIs from the cancer treatment center ”Centre Antoine Lacassagne” in Nice, France, courtesy of Dr Pierre-Yves Bondiau. All images are 256×256×60 and coded on 16 bits, thus leading to a 7.8 MB size per image. Each of the input image pair was registered with the 4 algorithms and leads to 6 grid job submissions (triple contour services in figure 1). The 4 rigid registration algorithms used reached a sub-voxel accuracy of 0.15 degree in rotation and 0.4 mm in translation for the registration of these images. 4.1. MOTEUR performances The first experiment, reported in figure 2, is a comparison of MOTEUR performances against the Taverna workflow manager [10]. Taverna is a service-based workflow manager targeting bioinformatics application that is being developed in the UK eScience MyGrid project. Taverna has become a reference workflow manager in the eScience community. The figure displays the execution times ob-
101
T. Glatard et al. / Medical Image Registration Algorithms Assessment 35000 Taverna-EGEE MOTEUR-EGEE 30000
Execution time (s)
25000
20000
15000
10000
5000
0 0
20
40
60
80
100
120
140
Number of input image pairs
Figure 2. Execution times of MOTEUR vs Taverna on the EGEE production infrastructure
tained with Taverna and MOTEUR w.r.t. the number of input data sets. The figure shows that MOTEUR introduces an average speed-up of 2.03. Even more interesting, this speed-up is growing with the number of input data sets to process. The performance gain is due to the full exploitation of the data and services parallelism: Taverna does not provide service parallelism and data parallelism is limited to a fixed number of parallel invocations. The second experiment reported in figure 3 quantifies the performance gain introduced by the different level of optimization implemented in MOTEUR. We executed the Bronze Standard workflow on 3 different inputs data sets composed by 12, 66 and 126 image pairs, corresponding to images from 1, 7 and 25 patients respectively. In total, the workflow execution resulted in 6 times more job submissions (72, 396 and 756 jobs respectively). We computed the Bronze Standard with different optimization configurations in order to identify the specific gain provided by each optimization. The reference curve (plain curve, labeled NOP) corresponds to a naive execution where only workflow parallelism is activated. The Job Grouping optimization (JG curve) reduces the jobs submission overhead as expected. The time gain is almost constant independently of the number of input data. Unsurprisingly, the most drastic optimization is the Data Parallelism (DP curve) for this data intensive application. The speed-up grows with the number of images to be processed (the DP curve slope is lower than the reference curve slope). Theoretically, the DP curve should be horizontal (no overhead introduce by the increasing number of data) given that the number of grid processing units exceeds the number of jobs submitted. However, the EGEE grid is exploited in production mode (24/7
102
T. Glatard et al. / Medical Image Registration Algorithms Assessment 40 NOP JG DP SP + DP SP + DP + JG
35
Execution time (hours)
30
25
20
15
10
5
0 0
20
40
60 80 Number of input image pairs
100
120
140
Figure 3. Comparison of the execution times obtained for different optimization configurations
workload) by a large multi-users community. Therefore, the Service Parallelism optimization (DP+SP curve) further improves performances. Finally, combining all these optimizations (SP+DP+JG curve) provides the best result. The final speed-up is higher than 9.1 when considering the largest scale experiment. 4.2. Conclusions Data intensive applications are common in the medical image analysis community and there is an increasing need for computing infrastructures capable of efficiently processing large image databases. The Bronze Standard application is a concrete example to registration algorithms assessment with an important impact for medical image analysis procedures. The application is assembled from a set of legacy code components, wrapped into a generic web service and enacted on the EGEE grid through the MOTEUR workflow enactor. We demonstrated MOTEUR capabilities and performances. This workflow engine is conforming to the Scufl workflow description language. It implements interfaces to Web and GridRPC services. MOTEUR has been interfaced to the EGEE production grid infrastructure and the Grid5000 experimental infrastructure. The workflow execution is optimized using different parallelization strategies enabling the exploitation of the grid parallel resources. MOTEUR is freely available for download under a GPL-like license. Acknowledgments This work is partly funded by the French research program “ACI-Masse de donn´ees” (http://acimd.labri.fr/), AGIR project (http://www.aci-agir.org/). We
T. Glatard et al. / Medical Image Registration Algorithms Assessment
103
are grateful to the EGEE European IST project (http://www.eu-egee.org) for providing the infrastructure used in the experiments presented.
References [1] H. Benoit-Catttin, F. Bellet, J. Montagnat, and C. Odet. Magnetic Resonance Imaging (MRI) Simulation on a Grid Computing Architecture. In Biogrid’03, proceedings of the IEEE CCGrid03, Tokyo, Japan, May 2003. [2] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, and G. Mehta et al. Mapping abstract complex workflows onto grid environments. Jnl of Grid Comp., 1(1):9 – 23, 2003. [3] Ian Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. In International Conference on Network and Parallel Computing (IFIP), volume 3779, pages 2–13. Springer-Verlag LNCS, 2005. [4] P. Jannin, J.M. Fitzpatrick, D.J. Hawkes, X. Pennec, R. Shahidi, and M.W. Vannier. Validation of medical image processing in image-guided therapy. IEEE Trans. on Medical Imaging, 21(12):1445–1449, December 2002. [5] Pter Kacsuk, Ariel Goyeneche, Thierry Delaitre, Tams Kiss, Zoltn Farkas, and Tams Boczko. High-Level Grid Application Environment to Use Legacy Codes as OGSA Grid Services. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID ’04), pages 428–435, Washington, DC, USA, 2004. IEEE Computer Society. [6] Bertram Ludscher, Ikay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 2005. [7] J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legr´e, I.E. Magnin, L. Maigne, S. Miguet, J.-M. Pierson, L. Seitz, and T. Tweed. Medical images simulation, storage, and processing on the european datagrid testbed. Journal of Grid Computing, 2(4):387–400, December 2004. [8] Hidemoto Nakada, Satoshi Matsuoka, K Seymour, J Dongarra, C Lee, and Henri Casanova. A GridRPC Model and API for End-User Applications. Technical report, Global Grid Forum (GGF), jul 2005. [9] Stphane Nicolau, Xavier Pennec, Luc Soler, and Nicholas Ayache. Evaluation of a New 3D/2D Registration Criterion for Liver Radio-Frequencies Guided by Augmented Reality. In International Symposium on Surgery Simulation and Soft Tissue Modeling (IS4TM’03), volume 2673 of LNCS, pages 270–283, Juan-les-Pins, 2003. INRIA Sophia Antipolis, Springer-Verlag. [10] Tom Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, Mark Greenwood, Tim Carver, Kevin Glover, Matthew R. Pocock, Anil Wipat, and Peter Li. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics journal, 17(20):3045–3054, 2004. [11] X. Pennec, R.G. Guttman, and J.-P. Thirion. Feature-Based Registration of Medical Images: Estimation and Validation of the Pose Accuracy. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’98), volume 1496 of LNCS, pages 1107–1114, Cambridge, USA, October 1998. Springer. [12] Ian Taylor, Ian Wand, Matthew Shields, and Shalil Majithia. Distributed computing with Triana on the Grid. Concurrency and Computation: Practice & Experience, 17(1–18), 2005. [13] (W3C) World Wide Web Consortium. Web Services Description Language (WSDL) 1.1, mar 2001.
This page intentionally left blank
Part II Ethical, Legal and Privacy Issues on HealthGrids
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
107
The Ban on Processing Medical Data in European Law: Consent and Alternative Solutions to Legitimate Processing of Medical Data in HealthGrid Jean Herveg Maître de conférences aux FUNDP – Faculté de Droit – D.E.S. D.G.T.I.C. Centre de Recherches Informatique et Droit Avocat au barreau de Bruxelles
Abstract. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data bans the processing of medical data owing to their highly sensitive nature. Fortunately the Directive provides that this ban does not apply in seven cases. The paper aims first to explain the reasons for this ban. Then it describes the conditions under which medical data may be processed under European Law. The paper investigates notably the strengths and weaknesses of the data subject’s consent as base of legitimacy for the processing of medical data. It also considers the six other alternatives to legitimate the processing of medical data. Keywords . Processing of Medical Data – Legitimacy – European Law – HealthGRID
INTRODUCTION 1. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data [1] bans the processing of personal data concerning health (medical data) [2]. Naturally this prohibition applies equally to the processing of medical data in HealthGrid. This petitio principii could have led to serious problems , notably for HealthGrid, if the Directive had not provided that this ban does not apply in several cases [3]. Before considering these exceptions, it seems relevant to remind the reason for this ban particularly since the latter apparently opposes the free movement of personal data [4].
1. THE BAN ON PROCESSING MEDICAL DATA 2. The regulation of the processing of personal data is based upon two main ideas. The first idea is that the economical, social, cultural and individual activities, with no public or private distinction, require in various extents the processing of information
108
J. Herveg / The Ban on Processing Medical Data in European Law
relative to natural persons. The second idea, intimately bound to the first one, is that natural persons must be protected against any infringement to their fundamental rights and freedoms that might arise from the processing of information relative to them. In other words, the processing of personal data is frequently needed for multiple good reasons. But, in the same time, the processing of personal data induces the danger to expose natural persons to grave risks of discriminations or infringements to their fundamental rights and freedoms. With respect to this and with this aim in view, the processing of personal data must comply with several rules expressing the balance between all the interests in presence. In this context Directive 95/46/EC aims to ensure the protection of fundamental rights and freedoms of natural persons (data subject), and in particular their right to privacy with respect to the processing of personal data [5]. This protection requires regulating the processing of personal data in order to prevent any infringement to the fundamental rights and freedoms of the data subject. To be effective and coherent this regulation has to be built on the analysis of the risks capable to affect the fundamental rights and freedoms of the data subject. It is only possible to determine the conditions under which personal data can be processed in full respect of the fundamental rights and freedoms of data subjects when these risks are identified. This risk assessment is particularly important since the recent evolutions of Information and Communication Technologies have multiplied the possibilities to process personal data and therefore increased the risks of infringement to the fundamental rights and freedoms of the data subject. The use of a new technology such as HealthGrid should naturally induce the assessment of the new risks attached to its implementation especially in healthcare regarding the protection of medical data. 3. The general principle is that the risk of infringement to the rights and freedoms of the data subject does not depend on the information content. But the risk depends on the purpose of the processing of personal data. In other words the potential or actual danger for the fundamental rights and freedoms of the data subject has to be assessed regarding the purpose of the processing of personal data. But the principle is slightly – though not entirely – different for sensitive data [6]. It is commonly admitted that the sole content of these data already exposes the data subject to the risk of infringement of his or her fundamental rights and freedoms, whatever could be the purpose of the data processing. Put differently, any use of sensitive data is susceptible to create grave risks of discrimination for the data subject. Therefore sensitive data require a special protection taking into account their content and the purpose of their processing. With this end in view the Directive has decided that “data which are capable by their nature of infringing fundamental freedoms or privacy should not be processed (…)” [7]. The ban on processing medical data is the special protection provided by the Directive to ensure the respect of the fundamental rights and freedoms of the data subject regarding the processing of his or her medical data. Hence the ban on processing medical data should not be seen as opposed to the free movement of personal data. The ban on processing medical data is more a limit than an exception to the free movement of personal data. In fact the free movement of personal
J. Herveg / The Ban on Processing Medical Data in European Law
109
data can only be conceived in the full respect of the fundamental rights and freedoms of the data subject and this respect includes the ban on processing medical data.
2. EXCEPTIONS TO THE BAN ON PROCESSING MEDICAL DATA 4. Nevertheless the Directive grants permission to process medical data in seven hypotheses. In these ones the legitimacy of the processing of medical data (the balance between the interests in presence [8]) is formally presumed (cf. infra the necessity to really assess its legitimacy). This comes from the fact that, in principle, the situations described in these hypotheses should justify the processing of medical data, without prejudice for the other conditions ensuring the lawfulness of the data processing. These exceptions to the ban on processing medical data must be restrictively interpreted. The processing of medical data is strictly forbidden beyond these exceptions. The first hypothesis granting permission to process medical data is the consent of the data subject. The data subject’s consent is frequently presented as the natural base for the legitimacy of the processing of medical data. 2.1. The consent of the data subject 5. According to the Directive the ban on processing medical data does not apply where the data subject has given his or her explicit consent to the processing of his or her medical data [9]. In this case the Directive entrusts the data subject with the power to authorize the processing of his or her medical data [7]. This empowerment of the data subject represents without any doubt a very strong expression of his or her informational self determination – the power of the data subject upon his or her personal data – [10]. But this empowerment could also surprise. Is the data subject always capable to decide in a reasonable way about the processing of his or her medical data? Isn’t it too dangerous to give the data subject such power when most of the time he or she represents the “weakest” party or at least the “demanding” person in the processing his or her medical data? By example, how could a patient oppose the processing of his or her medical data for scientific purpose (ex. for a clinical trial) before a surgery or any other investigation? How to ensure the validity of the data subject’s consent and avoid a complete masquerade? This empowerment of the data subject should not be seen as unlimited or under no control. In fact when given this power the data subject has to evaluate the interest(s) that could justify the processing of his or her medical data. With this end in view the data subject has to put correctly into balance the interests in presence and to act accordingly. Otherwise the consent will not be able to legitimate the processing of his or her medical data (see infra about the real control of the legitimacy of the processing of medical data and the determination of the interests in presence). The Directive confirms this analysis.
110
J. Herveg / The Ban on Processing Medical Data in European Law
6. Regarding the Directive the data subject's consent means “any freely given specific and informed indication of his wishes by which the data subject signifies his agreement to personal data relating to him being processed” [11]. First the consent has to be indubitable, indisputable, without any doubt. Then the consent of the data subject must have been freely given. In this regard the consent has to be free of any vice, constraint or pressure. With respect to this any direct profit (such as the benefit for his or her health) or indirect profit (such as the participation to the progress of medical science) for the patient should not affect automatically the validity of the data subject’s consent. Would the financial retribution of the data subject (beyond the cover of his or her eventual expenses ) invalidate his or her consent? Again, the answer to this question should not be absolute. It should depend upon the circumstances of each considered case and on how the applicable law deals with the protection of the data subject. Moreover the consent of the data subject has to be specific and informed. To be specific reminds insistently that the data subject must know exactly what he or she consents to. The latter implies necessarily the prior and adequate information of the data subject concerning the processing of his or her medical data. Without this prior and adequate information the consent of the data subject shall not be specific. Therefore and in any case the consent of the data subject could not ground the processing of his or her medical data. In this view the next question is logically the determination of the detail level of the provided information to the data subject. Articles 10 and 11 of the Directive determine the minimum content of this information. The latter must permit the complete enforcement of all the aspects of the data processing – such as the data quality, the data subject’s rights, the security and confidentiality measures, the notification to the supervisory authority, etc. –. However there is no doubt that the information has to be more accurate and complete particularly since very sensitive data as medical data are processed. In any case the data subject may not give an unspecified or uninformed consent to the processing of his or her medical data. Further processing of medical data is prohibited when incompatible with the initial purpose for which data have been collected. The consent must be given prior the time of the data collection. It must not be given necessarily at the same time; it only has to be obtained prior the processing. 7. The consent of the data subject must be explicit to allow the processing of his or her medical data [12]. A contrario, the requirement of an explicit consent should exclude any implicit consent – whatever could mean this last notion –. With respect to this, beyond the indisputable character of the data subject’s consent, its explicit characteristic presumes that it has been expressed. Several Member States have decided to transpose this requirement by asking for a written consent from the data subject. However the explicit consent could be deduced from some other behaviour of the data subject especially regarding the circumstances of the case. Indeed some positive actions could express the explicit consent of the data subject to the processing of his or her
J. Herveg / The Ban on Processing Medical Data in European Law
111
medical data, such as the participation to a foundation fighting against the disease affecting the data subject or as the demand to be treated in a special medical unit notoriously known as being a research unit. 8. In all these circumstances the consent of the data subject induces a presumption of legitimacy of the processing of his or her medical data. It is assumed that the data subject has correctly assessed the interests in presence and acted accordingly. If the data subject has not correctly assessed the interests in presence and if the interests in presence are not respected, his or her consent will not legitimate the processing of his or her medical data. The latter will not be legitimate on this ground. In other words the consent of the data subject does not exonerate the data controller from pursuing a legitimate purpose (inducing the balance between the interests in presence) and the consent of the data subject may not cover the illegitimate interest or the lack of interest of the data processing. 9. The Directive provides that Member States may oppose the possibility for the sole consent of the data subject to lift the prohibition from processing medical data [13]. 10. In any case the data subject may always revoke his or her consent to the processing of his or her medical data. What are the consequences of this revocation? Does it mean that, in the future, new operations upon the data subject’s medical data will not be any more possible (without any effect on the existing data processing) or do we have to considered that the operations realised upon the medical data on the ground of the initial consent of the data subject may not be pursued ? Since the data subject has revoked his or her initial consent there is no more legitimate base for the processing of the medical data. The operations may not be pursued. That does not mean that the past operations realised upon the medical data of the data subject are now unlawful. It simply means that they can not be pursued except on the ground of another base of legitimacy. 11. Finally the Directive gives no formal indication on the nature of the consent given by the data subject or on the possible contractual relationship between the data controller and the data subject. In our views the solution to these questions depends on how the applicable law deals with the relationship between the data controller and the data subject and with the relationship between the data subject and his or her personal data. In any case the possible contract should obey the special rules imposed through the transposition of the Directive in the applicable law such as the characteristics of the data subject’s consent, the data quality, the data subject’s rights, the security and confidentiality measures, the notification to the supervisory authority, etc. The applicable law determines also the capacity to consent for underage or disable persons. Regarding the previous developments, it is not sure that the consent of the data subject represents the best solution to ground the legitimacy of the processing of medical data in HealthGrid. Fortunately the Directive provides alternative solutions to legitimate the processing of medical data.
112
J. Herveg / The Ban on Processing Medical Data in European Law
2.2. Carrying out obligations and specific rights of the data controller in the field of employment law 12. The ban on processing medical data does not apply where the “processing is necessary for the purposes of carrying out the obligations and specific rights of the controller in the field of employment law in so far as it is authorized by national law providing for adequate safeguards” [14]. With respect to this, the purpose of the data processing is only to allow the data controller to fulfill his obligations and rights in the matter of Employment Law, the latter must being specific. This hypothesis seems to cover Medical Inspection. Then the processing of medical data has to be necessary and not only useful to this purpose. Therefore the data controller has to prove the necessity to process medical data to carry out his obligations and specific rights in the field of Employment Law. Finally this kind of processing has to be authorized by the applicable law providing for adequate safeguards, the latter being not further determined. 2.3. Vital interests 13. The third hypothesis allowing for the processing of medical data is where “processing is necessary to protect the vital interests of the data subject or of another person where the data subject is physically or legally incapable of giving his consent” [15]. The notion of “vital interest” means expressly and exclusively the situation of an imminent danger to the life of a natural person. This covers the protection of the vital interests of the data subject but also of any other natural person. However in this last situation the Directive adds that the data subject mu st be physically or legally incapable of consenting to the processing of his or her medical data. It can not be deduced from this disposition that the data subject, physically or legally capable of consenting, could, without any consequence, refuse to authorize the processing of his or her medical data when the vital interests of another person are at stake. The qualification of this behaviour should be qualified under the applicable law. 2.4. Non profit organisation 14. The processing of medical data could be legitimate where the “processing is carried out in the course of its legitimate activities with appropriate guarantees by a foundation, association or any other non-profit-seeking body with a political, philosophical, religious or trade-union aim and on condition that the processing relates solely to the members of the body or to persons who have regular contact with it in connection with its purposes and that the data are not disclosed to a third party without the consent of the data subjects” [16]. With respect to this the organization must have a non profit purpose and the latter has to be relative to the exercise of fundamental rights and freedoms [7].
J. Herveg / The Ban on Processing Medical Data in European Law
113
2.5. Data manifestly made public and establishment, exercise or defence of legal claims 15. The ban on processing medical data does not apply where “the processing relates to data which are manifestly made public by the data subject or is necessary for the establishment, exercise or defence of legal claims” [17]. It has to be reminded that, even if manifestly made public by the data subject, the processing of his or her sensitive personal data falls nevertheless under the scope of the Directive. Hence the data controller must comply with all the other conditions ensuring the lawfulness of the data processing. 2.6. Healthcare purpose 16. The ban on processing medical data does not apply “where processing of the data is required for the purposes of preventive medicine, medical diagnosis, the provision of care or treatment or the management of health-care services, and where those data are processed by a health professional subject under national law or rules established by national competent bodies to the obligation of professional secrecy or by another person also subject to an equivalent obligation of secrecy” [18]. The healthcare purpose should be interpreted broadly [19] including the management of healthcare services. The latter should include secondary purposes necessary to provide healthcare such as medical secretaries, computer Departments, etc. By contrast, this hypothesis does not include Social Security purposes or Public Health purposes (cf. infra 2.7). Medical data mu st be processed by a health professional, but this last notion has not been further defined. The health professional has to be subject under national law or rules established by national competent bodies to professional secrecy. When not processed by a health professional, the processing may be carried out by another person if he or she is subject to an equivalent obligation of secrecy notably due to his or her status or by way of contractual stipulation or term. It is quite remarkable that the patient’s consent is not required to legitimate the processing of medical data. Might there be confusion with the consent to the provision of healthcare ? 2.7. Reasons of substantial public interest 17. The Directive grants Member States with permission to lay down additional exemptions for reasons of substantial public interest [20]. Hence the Member State has to prove in each case the real existence of the considered substantial public interest(s). The Directive had essentially in mind substantial public interests relative to Public Health and Social Security “especially in order to ensure the quality and costeffectiveness of the procedures used for settling claims for benefits and services in the health insurance system (…)” [21]. It had also in mind scientific research and public statistics [21].
114
J. Herveg / The Ban on Processing Medical Data in European Law
The cases where medical data may be processed must be laid down by national law or by decision of the supervisory authority. But Member States may only allow for the processing of medical data if these exceptions are subject to the provision of suitable safeguards to protect the fundamental rights and freedoms of the data subjects and especially their right to respect for private life [21]. The Directive does not determine these safeguards. Member States must notify to the European Commission the exemptions to the ban on processing medical data adopted on this base [22]. Member States must determine the conditions under which a national identification number or any other identifier of general application may be processed [23].
3. REAL ASSESSMENT OF THE LEGITIMACY OF THE PROCESSING OF MEDICAL DATA 18. The legitimacy of the processing of medical data is not complete when only formally fitting into one of these exceptions to the ban on processing medical data, even with the consent of the data subject. Indeed these exceptions are only hypotheses where the legitimacy of the data processing is formally assumed. Now the legitimacy of the processing of medical data – the balance of the interests in presence – has to be really assessed. First the interests in presence have to be identified. Are they only the interests of the data controller and of the data subject or should we also consider the interests of third concerned parties and of the whole society? In our view these two last categories of interests should be taken into account when evaluating the legitimacy of the processing of medical data. Then the explicit and valid consent of the data subject presumes, until contrary proof, the existence of an acceptable balance between the interests in presence in the processing of his or her medical data. However, in this case, it is quite difficult to assume that the data subject has adequately taken into account interests other than one’s own. In any case the processing of medical data will not be legitimate if the balance between the interests in presence is not respected, even with the regular consent of the data subject. 19. But the legitimacy of the processing of medical data is definitely and very usefully strengthened by the additional consent of the data subject. That is the reason why we must firmly approve and recommend the ethical practice aiming to obtain the consent of the data subject when processing medical data. This practice is frequent in the conduct of clinical trials and in telematic networks in healthcare. 20. Finally, it has to be stressed that the data controller may not legitimate the processing of medical data on other bases. That excludes necessarily the use of the hypotheses of formal legitimacy enumerated in article 7 of the Directive for nonsensitive personal data. By example the data controller may not legitimate the
J. Herveg / The Ban on Processing Medical Data in European Law
115
processing of medical data by the balance of the interests in presence without respecting the hypotheses enumerated in article 8.
CONCLUSIONS 21. The protection of medical data implies to fix the rules applicable to the processing of medical data and hence to determine their conditions. With regard to their highly sensitive nature, medical data require a special protection taking into account their content and the purpose of their processing. Therefore Directive 95/46/EC has decided to prohibit the processing of medical data. However the Directive provides that this ban does not apply in several cases. In these cases the legitimacy of the processing of medical data is formally assumed without prejudice for the other conditions ensuring the lawfulness of the data processing. These exceptions to the ban on processing medical data have to be restrictively interpreted. The explicit and valid consent of the data subject constitutes the very first source of legitimacy for the processing of his or her medical data even if, at the same time, it is the weakest base to legitimate the processing of medical data due to the strict conditions for its validity and to the possibility for the data subject to revoke his or her consent at any time and without justification (but with reasonable notice in some case?). Nevertheless even if the data controller may legitimate the processing of medical and even with the consent of the data subject, the legitimacy of the data processing must be really assessed in each case by the balance of the interests in presence. These include the interests of the data subject, of the data controller, of third concerned parties and of the society. In any case the consent of the data subject does not cover the lack of legitimacy or the illegitimacy of the processing of his or her medical data. The consent of the data subject only creates the presumption of legitimacy of the processing of medical data until proof of the contrary. Finally we must approve and recommend very strongly and warmly the ethical practice requiring the consent of the data subject when processing medical data, even the latter might rely on another base of legitimacy.
Endnotes [1] [2]
[3]
On the Directive : Y. Poullet, M.-H. Boulanger, C. de Terwangne, Th. Leonard, S. Louveaux et D. Moreau, La protection des données à caractère personnel en droit communautaire, Journal des Tribunaux de droit européen, Bruxelles, Ed. Larcier, 1997, p. 121 (in three parts). Directive 95/46/EC, art. 8.1. The notion of medical data includes all information relative to any aspect, physical or psychological, of the present, past or future health condition, good or bad, of a living or dead natural person. On the definition of medical data : Explanatory report of Convention n° 108, recital 45 ; Rec. (97) 5 of the Council of Europe relative to the protection of medical data , art. I of the annex ; C.J.C.E., 6 Nov. 2003, Bodil Lindqvist, case C-101/01, obs. C. de TERWANGNE, « Affaire Lindqvist ou quand la Cour de justice des Communautés européennes prend position en matière de protection des données personnelles », R.D.T.I., 2004, pp. 67-99 ; Groupe européen d’éthique des sciences et des nouvelles technologies, avis n° 13 du 30 juillet 1999 sur les aspects éthiques de l’utilisation des données personnelles de santé dans la société de l’information. Directive 95/46/EC, art. 8.2.
116
[4] [5] [6] [7] [8] [9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
J. Herveg / The Ban on Processing Medical Data in European Law
On the free movement of personal data : Directive 95/46/CE, art. 1.2, and recitals 3, 4, 5, 6, 7, 8, and 9. Directive 95/46/CE , art. 1.1. Usually, sensitive data are personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership and personal data concerning health or sex life. Directive 95/46/CE , recital 33. Cf. infra for the identification of these interests. Directive 95/46/CE , art. 8.2. a. The national law may provide that the data subject’s consent may not lift the prohibition. On the notion of informational self-determination : Fr. RIGAUX, La protection de la vie privée et des autres biens de la personnalité, Bruxelles, Paris, Bruylant, L.G.D.J., 1990, p. 588-589, n° 532 : « (…) La juridiction constitutionnelle a déduit du droit de la personnalité l’un de ses attributs, à savoir : « le pouvoir reconnu à l’individu et résultant de la notion d’auto-détermination, de décider en premier lieu lui-même quand et dans quelle mesure des faits relatifs à sa propre existence sont divulgués (…) Cet attribut du droit de la personnalité est appelé « droit à la maîtrise des données personnelles » (…) Il n’est toutefois pas sans limite. (…) » ; Council of Europe, Resolution 1165 (1998), 26 June 1998, Droit au respect de la vie privée (24th Session), point 5. Directive 95/46/CE , art. 2, h. Directive 95/46/EC, art. 8.2, a) and recital 33. Directive 95/46/EC, art. 8.2, a). Directive 95/46/EC, art. 8.2, b). Directive 95/46/EC, art. 8.2, c). Directive 95/46/EC, art. 8.2, d). Directive 95/46/EC, art. 8.2, e). Directive 95/46/EC, art. 8.3. However the Directive seems to include only certain purposes relative to healthcare (cf. recital 33). Directive 95/46/EC, art. 8.4. Directive 95/46/EC, recital 34. Directive 95/46/EC, art. 8.6. Directive 95/46/EC, art. 8.7.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
117
Development of Grid Frameworks for Clinical Trials and Epidemiological Studies Richard SINNOTT, Anthony STELL, Oluwafemi AJAYI National e-Science Centre, University of Glasgow, United Kingdom Abstract. E-Health initiatives such as electronic clinical trials and epidemiological studies require access to and usage of a range of both clinical and other data sets. Such data sets are typically only available over many heterogeneous domains where a plethora of often legacy based or in-house/bespoke IT solutions exist. Considerable efforts and investments are being made across the UK to upgrade the IT infrastructures across the National Health Service (NHS) such as the National Program for IT in the NHS (NPFIT) [1]. However, it is the case that currently independent and largely non-interoperable IT solutions exist across hospitals, trusts, disease registries and GP practices – this includes security as well as more general compute and data infrastructures. Grid technology allows issues of distribution and heterogeneity to be overcome, however the clinical trials domain places special demands on security and data which hitherto the Grid community have not satisfactorily addressed. These challenges are often common across many studies and trials hence the development of a re-usable framework for creation and subsequent management of such infrastructures is highly desirable. In this paper we present the challenges in developing such a framework and outline initial scenarios and prototypes developed within the MRC funded Virtual Organisations for Trials and Epidemiological Studies (VOTES) project [2].
1. Introduction Clinical trials allow for the large-scale assessment of the moderate effects of treatment on various diseases and conditions. Typically the various stages of a trial involve identifying willing participants, evaluating their eligibility for the study, obtaining their consent, beginning the course of treatment and undertaking follow-up study both during and potentially long after the treatment has completed. Statistical analysis of the impact of the trials, e.g. on the efficacy of the drugs being tested can then be undertaken. The large-scale processes involved in this can be broadly broken down into three areas: patient recruitment; data management, and study administration and coordination. Until recently it was the case that clinical trials and epidemiological studies would be human intensive and paper based. Examples include, the West of Scotland Coronary Prevention Scheme (WOSCOPS) study [3] conducted at the University of Glasgow, where over 20,000 letters were sent out to eventually recruit 6595 middle-aged men (age 45-64) with a mean cholesterol of 7.0 +/- 0.6mmol. On a much larger scale the UK BioBank effort [4] will be sending many millions of letters to potential trial participants in the hope of recruiting 500,000 members of the population between 40-69 years of age. Not only are these expensive solutions, they are also highly inefficient and human intensive often with members of the population being contacted that do not meet the appropriate constraints for the given trial, e.g. their cholesterol is too high or too low,
118
R. Sinnott et al. / Development of Grid Frameworks
or they are on other drug treatments etc. E-health initiatives are now moving towards electronic based clinical trials which in principle offer solutions to improve how trials are set up and subsequently managed. However, establishing an electronic trial is not without its own challenges. Each individual trial will face the same kinds of challenges for recruitment, data management and study co-ordination, hence a framework supporting a multitude of trials would be extremely beneficial and is something currently being explored within the MRC funded VOTES project [2]. To establish an e-Infrastructure for clinical trials requires addressing heterogeneity and distribution of systems and data sets, and differences in general practices, e.g. how data is backed up (or not) at given sites. One of the key challenges from an IT perspective is security. The “weakest link” adage applies to security and a single site that does not take appropriate security considerations, both in terms of the technologies they have used, how they are using them and their general practices, can in principle jeopardise the security of all collaborating sites [5]. The risk of data disclosure is an ever present security risk that cannot be ignored. Ensuring that Caldicott guardians and other independent senior health professionals with strategic roles for the management of the data protection or confidentiality associated with patient data sets are involved in the decisions that influence the development of such infrastructures is crucial to their success; from their development, their acceptance, and perhaps more importantly their ethical usage. It could be argued that the immediate hurdle in establishing an electronic clinical trial is how to recruit people. Key sources of data in Scotland include national census data sets such as the General Register Office for Scotland [6] which includes information such as the registration of births, marriages, deaths as well as being the main sources of family history records. The access to such information whilst useful does not include direct health related information which will likely impact upon the suitability of patients to a trial. Primary care and secondary health care data sets are other immediate choices, however access to and usage of these data sets will likely require ethical approval. Patients should have the opportunity to consent that their data can be accessed and used. However in running a clinical trial, it is often the case that statistical information is enough. Thus rather than disclosing information on specific patients, statistical information is sufficient. Even here however, questions on ethics are raised. At the very least, doctors and their patients need to be included in any data access decisions. Yet the establishment and running of electronic clinical trials is a compelling one with data often being stored in some form of digital format, albeit across a multitude of databases behind firewalls. One of the key challenges is to allow secure access to these data sets to the right people for the right purpose. High levels of security should not be at the cost of usability. A good example of this is the remote control car key - a far improved and more complex technologically, security solution, but easier to access and use. Similarly, end users of e-Infrastructures should be largely unaware of the fine grained security solutions that are restricting and controlling their access and usage of the facilities. Usability of the infrastructures is of uppermost importance to their success and take-up [7]. In this paper we describe our attempts to establish and support a Grid framework at the National e-Science Centre (NeSC) in Glasgow as part of the initial phase of the VOTES project. As this work is in the early stages, the solution presented is necessarily grounded in this specific use-case but is conducted with a view to scaling up and generalising as the project proceeds. Through this framework we expect to support the
R. Sinnott et al. / Development of Grid Frameworks
119
efficient establishment and subsequent conduct of clinical trials and studies. In the rest of this paper we present the technical and non-technical challenges facing the design and development of this framework, along with an outline of the early proof of concept prototypes currently supported. We also outline the future work of the project and challenges, still to be addressed to realise the vision of an e-Infrastructure for a range of clinical trials and studies.
2. Existing Infrastructures and Data Sets across Scotland The VOTES project [2] is a collaborative effort between e-Science, clinical and ethical research centres across the UK including the universities of Oxford, Glasgow, Imperial, Nottingham and Leicester. The primary focus of VOTES is to build an infrastructure to support a multitude of clinical virtual organisations. Virtual organisations (VOs) are a common concept in the Grid community and provide a conceptual framework through which the rules associated with the participants, their roles and the resources to be shared can be are agreed and subsequently enforced across the Grid. VOs in the clinical trials domain are characterised by a much greater degree of emphasis on security, data access and data ownership. We term these Clinical Virtual Organisations (CVOs) since they place requirements not typical to other High Performance Computing-oriented VOs common to the wider Grid community. Rather than developing bespoke CVOs for each individual clinical trial, it is our intention to develop a framework supporting a multitude of CVOs. Each of these CVOs will be derived from the framework and adapted depending on the needs of the trial or study being conducted. Common phases of many clinical trials and epidemiological studies, and the primary focus for core components that will exist in the VOTES Grid framework are: • Patient recruitment enabling semi-automated large-scale recruitment methods for investigators conducting large-scale clinical studies in a variety of settings; • Data collection incorporating data entry including intermittent connectivity to other resources, such as a trial-specific databases, code lists for adverse events and non-study drugs, randomization programs and support for internationalisation of case report forms; • Study administration supporting the administration of the study, including logging details of essential documents, enabling rapid dissemination of study documentation and by co-ordinating transport of study treatment and collection of study samples. The first step in developing a Grid framework for clinical trials is to identify the potential sources of data and services that allow access to such data. Close liaison with data providers, data owners and existing services is essential. Within the Scottish element of VOTES we are working closely with the NHS in Scotland who have identified the following data sets and software which provide initial coverage of the sets of data needed for clinical trials and epidemiological studies1: • The General Practice Administration System for Scotland (GPASS) [8] is the core IT application used by over 85% of clinicians and general practitioners involved in primary care across Scotland; 1
This does not imply that this data is readily available directly, but that these are the sources of data and software which we should be eventually interfacing with.
120
R. Sinnott et al. / Development of Grid Frameworks
•
Scottish Morbidity Records (SMR) [9] includes records relating to all patients discharged from non-psychiatric and non-obstetric wards in Scottish hospitals (including datasets on death, cancer, hospital admissions, etc.) • Scottish Care Information Store (SCI Store) [10] - a batch storage system which allows hospitals to add a variety of information to be shared across the community, e.g. pathology, radiology, biochemistry lab results are just some of the data that are supported by SCI Store. Regular updates to SCI Store are provided by the commercial supplier using a web services interface. Currently there are 15 different SCI Stores across Scotland (with 3 across the Strathclyde region alone). Each of these SCI Store versions has their own data models (and schemas) based upon the regional hospital systems they are supporting. The schemas and software itself are still undergoing development. • NHS data dictionary [11] - a one-stop shop for health and social care data definitions and standards. It contains a summary of concepts for SMR datasets including online manuals for the datasets; information on the clinical datasets in use in healthcare and social care datasets along with the data standards upon which they are based. The Scottish component of the Grid framework under development within VOTES is being targeted to these resources. Components which allow secure and ethical access to GPASS for example will provide a highly generic reusable solution applicable to over 85% of all practices across Scotland. Contemporaneously, solutions accessing NHS resources are also being developed by the other partners. A summary of the challenges involved includes broadly: the need for a common definition of clinical standards, the need to maintain security whilst still taking advantage of the flexibility of Grid solutions, the need for scalability, authorization and anonymisation. The following sections address these challenges in more detail.
3. Data Federation and Distributed Security Challenges As CVOs necessarily span heterogeneous domains, a pre-requisite to the construction of distributed queries and aggregation or joining of data returned is the development and use of a standard method of classification or common vocabulary more generally. This includes the naming of the data sets themselves, the people involved and their roles (privileges) in the access to and usage of these data sets amongst other things. Ideally these data and roles should be standardised so that comparisons can be drawn and queries joined together for example across a range of clinical data sets. There are numerous developments in standards for the description of data sets used in the clinical trials domain. However, this can be an involved process depending on standards groups developing and acting on strategies put together through major initiatives such as Health-Level 7 (HL7) [12], SNOMED-CT [13] and OpenEHR (Open Electronic Health Records)[14]. There are often a wide range of legacy data sets and naming conventions which impact upon standardisation processes and their acceptance. The International Statistical Classification of Disease and Related Health Problems version 10 (ICD-10) [15] is used for the recording of diseases and health related problems and is supported by the World Health Organisation. In Scotland, ICD10 is used within the NHS along with ICD version 9 and Read codes in the SMR data
R. Sinnott et al. / Development of Grid Frameworks
121
sets for example. ICD-10 was introduced in 1993, but the ICD classifications themselves have evolved since the 17th century [16]. An explicit example of the problems facing large scale (international) clinical trials is the term “neoplasia” which means “new growth for benign/malignant tumours” in Northern Europe but “cancer” in Southern Europe. Hence, the type of treatment provided depends heavily on the location of the patient. Global Grid frameworks that incorporate appropriate meta-data identifying the different local data classifications can provide capabilities to address such discrepancies. The standardisation process itself may influence how readily any given standard is adopted. For example, standards developed to specific deadlines during the standardisation-making process, and standards bodies producing regular updates with solutions readily available for implementation are more likely to gain acceptance. This is also the case within the Grid community. Linking standardised data descriptions between domains so that entities and relationships within one organisational hierarchy can be mapped or understood within the context of another domain is fundamental to the development of the Grid applications proposed in VOTES. Once it has been established how meaningful comparisons can be made between the schemata of differing domains, this knowledge can be applied to a generic clinical trial that could run queries across heterogeneous domains, bringing back generic results, richer in scope and information than if single local sites had been independently queried. Information stored in clinical trials is by its nature, highly sensitive – drug treatments, conditions and diseases that patients have must be kept in the strictest confidence and the exact details should only be known about by a few privileged roles in the trial. This is one of the most fundamental challenges in this work – to realise the opportunities and benefits that can be brought to this field by Grid technology but to also maintain the high security standards that must be strictly adhered to. Within the Grid community VO security issues are generally grouped into the categories of: • Authentication – the discovery of a user’s identity. This is achieved in most Grid applications by the use of the well-established Public Key Infrastructure (PKI) technology [17]. • Authorization – the discovery of that user’s privileges based on their identity. This is less well-established in the Grid community. Various software solutions are available for the establishment of user privilege assertions – PERMIS [18] (which implements the Global Grid Forum Authz API [19]), Community Authorization Service (CAS) [20], Virtual Organisations Management Service (VOMS) [21], Akenti [22] – with no single model having been adopted over the others. • Accounting – logging the activity of users so that they can be held accountable for their actions within a system. This is also less well-established with many implementations coming from “home-grown” solutions within different projects. Though important in an overall security strategy, this area is usually addressed once the solid platform of authentication and authorization has been established. Authentication in the Grid is achieved using PKI technology. This involves using a combination of public certificates and public and private keys to verify that a user is who they say they are. This is a well-established way of establishing user identity however it has limitations as a standalone security solution in terms of general usability, security granularity and overall scalability [23,24]. A more scalable, user-oriented solution which is being explored within the VOTES project is the Internet2 Shibboleth technology [25]. Shibboleth allows the delegation of
122
R. Sinnott et al. / Development of Grid Frameworks
authentication to the local sites involved. Through agreed federations where security attributes for fine grained authorisation are pre-agreed, the users are able to access and use remote Grid resources through local (home) authentication [26,27]. Typically they will log in with their own usernames/passwords at their home institution and the security attributes (which might include their roles in particular clinical trials for example) are then released and used by the target site to determine whether access to the resources being requested should be granted. As well as supporting seamless single sign-on to Grid infrastructures, this model moves the whole process of identity establishment and authentication to the home site. It also minimises the potential dangers of users writing down their PKI passwords and transparently restricts what they are able to do on the remote Grid resources. In the clinical trial domain, it is paramount that site autonomy is supported. If the home site at which a user authenticates themselves does not release all necessary attributes as agreed within the federation, then the user will not be allowed access to and usage of the remote resource. We note that the Shibboleth model is inherently more static than the true dynamic vision of the Grid where data and resources are found and used “on-the-fly”. This static oriented model is consistent with the clinical domain however where it is highly unlikely that new people, new data sets or new services are continually, dynamically added or removed from the clinical environment. The issue in Grid security that is much less well-established that authentication is that of privilege management – what a user can actually do once their identity has been verified. The main issue is that of the heterogeneous nature of the domains across which the data is being federated. Security policies will naturally differ between local sites, which leads to several challenges when defining and implementing policies that take account of both local and remote security concerns. These include: • Applying a generic policy that takes into account of each local policy or linking local policies together using a standard interface. • Dynamically enforcing these policies so that, for example, restrictions applied by a site not providing pertinent information for a particular query will not impact on the sites that are involved. • Building a trust chain that allows local sites to authenticate to the VO and therefore, by proxy, be authenticated to limited resources at other sites without compromising protected resources at those other sites. • Prevention of inference (statistical disclosure) that arises when data is aggregated from numerous sources. • Maintaining data ownership and enforcing ownership policies regardless of where the data might be moved to or stored or used. In addition to authentication and authorization, another artefact of security that is essential in this domain is that of “anonymisation”. This process involves allowing less-privileged users to gather statistical data for the purposes of studies or trials, but without revealing the associated identifying data – this only being available to users with greater privileges. The NHS in Scotland currently achieves this by encrypting a unique number associated with all patients across Scotland: the Community Health Index (CHI) number. Once an anonymised patient has been matched for a clinical trial, this encrypted value can in principle be sent to the Practitioners Service group (http://www.psd.scot.nhs.uk/) of the NHS who will as one of the many services that they provide, decrypt it and contact the patients directly (assuming ethical permission
R. Sinnott et al. / Development of Grid Frameworks
123
has been granted for so doing) to ask if they wish to join the clinical trial. Several challenges must be overcome to support this including ensuring that only privileged users are able access and use data sets including this encrypted CHI number. A further challenge is that there are currently many independent solutions across the NHS for how they manage their infrastructures. Thus for example, there is no standardised way in which encryption is undertaken. Hence it is often difficult or impossible to ask Practitioners Services Division (PSD) to de-anonymise an encrypted CHI number if it is generated by arbitrary NHS trusts. Pragmatic solutions overcoming the nuances of NHS systems are thus necessary. Throughout the VOTES project, continuous ethical and legal overview of the solutions being put forward and the data sets being accessed are being made. This includes the perceived benefits of the research for the public, and is undertaken by independent ethical oversight committees. To support this, superior security roles for oversight committee members which allow access to all data sets and reports for given clinical trials will be made available. 4. Initial VOTES Scenarios, Architecture and Implementation In designing a reusable Grid framework for clinical trials immediate restrictions are imposed on the possible architectural solutions. Thus it is unlikely that direct access to and usage of “live” NHS data sets and resources will be achieved, where direct here implies that the Grid infrastructure can issue queries to a remote NHS controlled resource containing un-anonymised patient information, i.e. to a resources behind the NHS firewall. Nevertheless, it is possible to design solutions capturing sufficient information needed for a clinical trial without over-riding existing security solutions or assuming ethical permissions where none have been granted. Possible solutions being explored here include a push model (where anonymised NHS data sets are exported) to the academic Grid community (or to an NHS server in a demilitarised zone of the NHS). Another model is to allow the GPs and clinicians to drive the recruitment process, provided they consider that this is in the best interests of the patients. The exploration of these solutions may provide a basis for follow-up projects in this field. The following scenario presents a representative sequence of interactions demonstrating how primary care identification and recruitment of patients can be ethically achieved with patient and doctor consent. The scenario in Figure 1 is based on discussions with Scottish clinicians, NHS IT personnel and GPASS developers and is currently being prototyped in VOTES. Trials coordinator 0
Trials Portal
Personalised Services
1
GP with browser 5
3 4
5 7
Trial #2
Trial #3
8
Transfer Grid Node
OGSA-DAI
GPs Private Data Sets 6
Trial #1
2
9
Secure Data Repository
Figure 1: Example use of patient recruitment Grid application
124
R. Sinnott et al. / Development of Grid Frameworks
0. A trials coordinator logs into a portal hosting various CVOs associated with a variety of clinical trials2. At this point, a personalised environment is established based upon the specific role (in this case, that of the trials coordinator) in the CVO and the location from where they are accessing the portal. Thus they should only see the Grid services pertinent to the appropriate trial applicable to them, and hence the data sets associated with those services. 1. The trial coordinator wishes to recruit patients for a particular trial. These patient details are only available in GPs local (and secure) databases – extensions to this scenario dealing with access to and usage of hospital databases are also possible. Emails are sent to the GPs/hospitals with information describing the particular trial to be conducted, the general criteria applicable to matching patients and other information, e.g. financial information about partaking in the trial. The email contains a link to a Grid service (trial #1). The GPs themselves are described in policies associated with the tentative set up of a CVO for patient identification and recruitment. 2. We assume that the GP is interested in entering into the trial, i.e. they know that they have matching patients and they follow the attached link. Depending upon whether a PKI has been rolled out to this GP and a suitable certificate (e.g. using the X509 standard) is already in the browser or a username and password combination is used instead, the GP securely accesses the Grid service. In this scenario we assume trusted certificates are being used. 3. After extracting more information about the trial from the portal, the GP decides to download a signed XML pro-forma pre-designed for this specific trial. This is a mostly complete document describing the main information relevant to this trial as documented in the trial protocol, where the empty fields need to be filled through a query to the GPs database. 4. The signature of the signed pro-forma document is checked to ensure its authenticity and that it has not been corrupted. If these are both true, the document is used as the basis for an XML query against the GP’s database (GPASS supports such an interface). This query might in turn result in further information being extracted from other resources. 5. At this point, letters describing the trial to matching patients can be automatically produced. These are used to obtain patient consent before continuing further with the trial. 6. The matching patients may then consent to entering into the trial. Note that these letters of consent may be sent directly to the trial coordinator instead of the GP as depicted here. 7. The forms are automatically completed based on the results of the queries to the GP database, digitally signed and returned to the Grid service for that particular trial (trial #1). 8. The returned signed XML document is authenticated and checks on the sender (the GP) being authorised to upload this document are made, e.g. through checking that they were one of the GPs contacted initially. The document is validated to ensure its correctness, e.g. by ensuring it satisfies the associated schema and the relevant data fields are meaningfully completed (and match the desired constraints associated with participation in the trial). At this point, the responding GP is formally added to the
2 Of course there are scenarios which predate this one, e.g. how CVO is established in the first instance and the policies by which the VO will be organised, managed, enforced.
R. Sinnott et al. / Development of Grid Frameworks
125
CVO. Further follow up information may subsequently be sought, e.g. monitoring information related to the matching patients. 9. The completed XML document and the associated meta-data describing the history of how this information was established, by whom, when, for which trial etc are uploaded and securely added to the CVO repository for this particular trial. It is important to note in this scenario that patient consent is given (step 6) before patient data is returned to the clinical trials team. Another important aspect here is that the GP can decide whether this might be in the patients’ interest. The patient may ultimately say no and hence is always involved in the process. We note also that software solutions also exist for several parts of this scenario, e.g. automatic production of letters inviting patients to join the trial. Similar scenarios covering user-resource interactions are being developed and implemented within VOTES supporting secondary care patient recruitment as well as for general data collection and study management. In this scenario we include a secure repository accessible via the Open Grid Service Architecture Data Access and Integration (OGSA-DAI) middleware [28]. This repository forms part of what we term the “Transfer Grid” as indicated in Figure 2. The Transfer Grid infrastructure provides the core of the Grid infrastructure that the will underpin future CVOs, i.e. it is the platform, upon which the Grid solutions developed for security, data access and management, and data movement between repositories hosted at the partner and collaborating institutions can be supported. Since the Transfer Grid exists in the academic domain and not behind the NHS firewall, a variety of solutions for accessing and using the clinical trial data sets can be explored. The Grid applications pertinent to the clinical trials domain are constructed over this layer providing the deliverable trial services. This infrastructure will be expanded to include external peer sites of two classes: • Routine repositories such as those held by general practices, hospitals, disease-specific registries, device registries or the Office for National Statistics (ONS). • Study repositories such as research systems developed for a particular trial or observational study. These external peers will supply their own security policies, and may be intermittently connected to the Transfer Grid. As such, interfacing with routine repositories will be a highly involved and politically sensitive process. This motivates the need for the initial solution to be scalable. Clinical Virtual Organisation Framework Used to realise CVO-1 (e.g. for data collection)
CVO-2 (e.g. for recruitment)
LeiNott
GLA
Transfer Grid
Disease registries
Hospital databases
GPs OX
IMP
Clinical trial data sets
Figure 2: CVO Framework, Transfer Grid and Key Sources of Data
126
R. Sinnott et al. / Development of Grid Frameworks
4.1. Current Software Architecture The basic architecture of this Grid framework, which supports federated queries in a user oriented but secure manner, is depicted in Figure 3. This infrastructure corresponds to one node of the Transfer Grid outlined above and is hosted on a trial test bed at the National e-Science Centre (NeSC) at the University of Glasgow. Portal
Grid Server
Data Server
Globus Container
OGSA-DAI Service
Oxford
Glasgow SCI Store 1 (SQL Server)
Driving DB
SCI Store 2 (SQL Server)
Consent DB (Oracle 10g)
RCB Test Trials DB (SQL Server)
Figure 3: Software architecture schematic. The “Oxford” box indicates how other institutions will be added to the current design – the current implementation only incorporates the test databases running in Glasgow.
A GridSphere [29] portal front-end communicates to a Globus Toolkit [30] (v4.0) grid service, which in turn provides access to an OGSA-DAI [28] data service. This runs queries from the “driving database” using standard Simple Object Access Protocol (SOAP) message-passing, but also in turn runs queries from the subsidiary databases available from the pool for which it is responsible, using direct Java Database Connectivity (JDBC) connections. The technology used in this implementation places strong emphasis on the use of grid services – essentially web services with the additional notion of permanent state. Within the Grid community this paradigm has been largely seen as the most effective solution to implementing transient and dynamic virtual organisations. An example of this is the Web Services Resource Framework (WS-RF) [31] as implemented in version 4.0 of the Globus Toolkit. Issues of access control are integrated within this framework by means of a Security Assertion Markup Language (SAML), which allows a standard exchange of security assertions and attributes. A popular implementation of this standard has been the OpenSAML project [32], which is now following the latest release of SAML, v1.1, and is currently developing an implementation of v2.0 [33]. The user accesses this infrastructure through a Gridsphere portal at [2]. With the appropriate privileges, users can currently bring back data from the database backends implemented in multiple test repositories of SCI Store and GPASS. Unprivileged users can retrieve limited data-sets, with the identifying patient data anonymised and other restrictions applied. Through the use of this application, the end user is able to seamlessly access a set of resources, pertinent to clinical trials, in a dynamic, secure and pervasive fashion. Depending on the user’s privileges, the results returned have varying degrees of verbosity thereby allowing limited statistical analysis without compromising the privacy restrictions necessarily applied in such sensitive data.
R. Sinnott et al. / Development of Grid Frameworks
127
In the current version of the system to explore the problem space and gain familiarity with the clinical data sets used across Scotland, several “canned queries” representing valid clinical trial queries can be run which seamlessly access and use distributed back-end test databases as depicted in Figure 4.
Figure 4: Screen-shot of VOTES portal welcome screen (left) showing several “canned queries” with the type of result returned based on whether the user is privileged or not (right).
Users with insufficient privileges may still be able to run queries but may not be able to see all of the associated identifying data sets (see Figure 5). It is important to note that all of this is completely transparent to the end users of the system.
Figure 5: Results from an unprivileged user running a canned query. Identifying data is blanked out whilst statistically relevant data is available. Also the number of databases across which the query has been run is reduced because of lack of privileges.
Another key aspect of this infrastructure is how patient consent is handled. Currently the system supports a variety of models which are allowing exploration of the potential solution space for patient consent across Scotland. For example solutions have been prototyped which allow patients to consent to their data being used for a specific clinical trial, for a particular disease area or consent for their data being used generally. In addition, the system also allows for patients to opt out, i.e. their data sets may not be used for any purposes. Numerous variations on this are also being explored,
128
R. Sinnott et al. / Development of Grid Frameworks
e.g. the patients’ data may only be used provided they are contacted in advance. To support this, a consent database has been established and is used when joining of the federated queries is undertaken to decide whether the data should be displayed, displayed but anonymised, or not displayed at all. The NeSC at Glasgow have extensive experiences in a range of fine grained authorisation infrastructures across a range of application domains [34-36]. Whilst we expect to move the existing prototype to a more robust authorisation solution, for rapid prototyping purposes to explore the problem space and get user feedback as early as possible, we have developed an authorization infrastructure based on an access matrix as shown in Figure 6.
U2(R1 ∆ h2) = 0 U3(R3 ∆ h1) = 1 U4(R2 ∆ R3 ∆ h4) = 0 U1(R1 ∆ h3) = 1 where ∆ is a combination function, 0, 1 are bit-wise privileges, RX, hX are resources and Ux is a subject Figure 6: Access Matrix Model
The authorisation mechanism implements an access matrix model [37] that specifies bit-wise privileges of users and their associations to data objects in the CVO. The access matrix is designed to enforce discretionary and role based access control policies and has been constructed to be scalable for ease of growth parallel to the growth of the infrastructure as a whole. Comparison of this approach with other solutions such as Role Based Access Control solutions such as PERMIS will be undertaken, where user views of data sets will be mapped to CVO roles. The federated data system [38] is currently composed of four autonomous test sites, each providing a clinical data source using either SQL Server [39] or Oracle [40]. The data sources exposed by these sites are configured as data resources on an OGSADAI data service. The OGSA-DAI data service implements a head node model to drive the data federation. The head node is selected based on rules or request requirements and is responsible for decomposing queries, distributing sub-queries and gathering and joining query results. In the current implementation, data federation security is achieved at both local and remote level. The local level security, managed by each test site, filters and validates requests based on local policies at Database Management System (DBMS) levels. The remote level security is achieved by the exchange of access tokens between the designated Source of Authority (SOA) of each site. These access tokens are used to establish remote database connections between the sites in the federation. In principle local sites authorise their users based on delegated remote policies. This is along the lines of the CAS model [20]. 5. Conclusions and Future work The VOTES prototype software is very much a work in progress. Yet the experiences in developing this prototype are helping to gain a better understanding of the clinical domain problem space and shaping the planned Grid framework. The vision
R. Sinnott et al. / Development of Grid Frameworks
129
of a Grid framework eventually supporting a myriad of clinical trials and epidemiological studies is a compelling one, but can only be achieved once experiences have been gained in accessing and using a wide variety of clinical data sets. In achieving this, it is immediately apparent that there are a number of political and ethical issues that must be addressed when dealing with data-sharing between domains and these are inherently more difficult to deal with than the technological challenges. Whilst the NHS in Scotland and the UK more widely are taking steps to standardise the data-sets that they have, these are still far from being fully implemented (and accepted) by clinical practitioners. For instance, the unique index reference number the Community Health Index (CHI) has only been implemented across some regions of Scotland and therefore leaves certain areas with incomplete references. Those records that do not have the CHI number are referenced using a different Patient Identification (PID) number that will be idiosyncratic to the region in question. There is also a need to build up a trust relationship with the end-user institutions that we are working with to provide this clinical infrastructure. This necessarily takes time and will be furthered by engaging in an exchange program where employees from NeSC work with and understand the processes in the NHS IT departments and vice-versa. The current Grid infrastructure described here has allowed the investigation of automatically implementing combinations of patient consent policies. Ideally such a consent register would be maintained nationally, however this does not exist yet but is planned with the electronic patient record under discussions across the NHS in Scotland. Demonstrations of working solutions showing the trade-offs in consent or assent with opt in versus opt out possibilities allows the policy makers to see first hand what the impact of their ultimate decisions might have. We believe that it is easier to convince policy makers when they see actual working solutions rather than theoretical discussions of what might be achieved once the infrastructures are in place. The applications in this project are being developed with a view to being rolled out to the NHS Scotland in the first instance, moving from test data to “live” data with fully audited and standards-compliant security, upon establishment of reliability and production value. The eventual vision is that this infrastructure will one day be available on a global scale allowing health information to be exchanged across heterogeneous domains in a seamless, robust and secure manner. In this regard, we are currently exploring international collaborative possibilities with the caBIG project in the US [41] and closer to home in genetics and healthcare projects across Scotland [42].
6. References [1] National Program for IT in the NHS (NPFIT) - http://www.connectingforhealth.nhs.uk [2] Virtual Organisations for Trials and Epidemiological Studies (VOTES) http://www.nesc.ac.uk/hub/projects/votes/ [3] West Of Scotland Coronary Prevention Scheme (WOSCOPS) http://www.gla.ac.uk/departments/pathologicalbiochemistry/lipids/woscops.html [4] UK BioBank project - http://www.ukbiobank.ac.uk [5] R. O. Sinnott, Grid Security: Middleware, Practices and Outlook, prepared for the Joint Information Services Council (JISC), www.nesc.ac.uk/hub/projects/GridSecurityReport [6] General Register Office for Scotland, http://www.gro-scotland.gov.uk/ [7] R.O. Sinnott, Development of Usable Grid Services for the Biomedical Community, Workshop on Designing for Usability in e-Science, Edinburgh, January 2006, http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=613.
130
R. Sinnott et al. / Development of Grid Frameworks
[8] General Practitioners Administration System for Scotland (GPASS), http://www.show.scot.nhs.uk/gpass/ [9] Scottish Morbidity Records (SMR), http://www.show.scot.nhs.uk/indicators/SMR/Main.htm [10] Scottish Care Information (SCI) Store http://www.show.scot.nhs.uk/sci/products/store/SCIStore_Product_Description.htm [11] NHS Data Dictionary – www.isdscotland.org [12] Health-Level 7 (HL7) - http://www.hl7.org/ [13] SNOMED-CT - http://www.snomed.org/snomedct/ [14] OpenEHR - http://www.openehr.org/ [15] International Statistical Classification of Disease and Related Health Problems (ICD-10), http://www.connectingforhealth.nhs.uk/clinicalcoding/classifications/icd_10 [16] ICD background, http://www.connectingforhealth.nhs.uk/clinicalcoding/faqs/ [17] R. Housley, T. Polk, Planning for PKI: Best Practices Guide for Deploying Public Key Infrastructures, Wiley Computer Publishing, 2001. [18] PERMIS - http://sec.isi.salford.ac.uk/permis/ [19] R.O. Sinnott, D.W. Chadwick, Experiences of Using the GGF SAML AuthZ Interface, Proceedings of UK e-Science All Hands Meeting, September 2004, Nottingham, England. [20] CAS - http://www.globus.org/toolkit/docs/4.0/security/cas/ [21] VOMS - http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/voms.html [22] Akenti - http://dsd.lbl.gov/Akenti/ [23] R.O. Sinnott, A.J. Stell, D.W. Chadwick, O.Otenko, Experiences of Applying Advanced Grid Authorisation Infrastructures, Proceedings of European Grid Conference (EGC), LNCS 3470, pages 265-275, Volume editors: P.M.A. Sloot, A.G. Hoekstra, T. Priol, A. Reinefeld, M. Bubak, June 2005, Amsterdam, Holland. [24] A.J. Stell, R.O. Sinnott, J. Watt, Comparison of Advanced Authorisation Infrastructures for Grid Computing, Proceedings of International Conference on High Performance Computing Systems and Applications, May 2005, Guelph, Canada. [25] Shibboleth Project - http://shibboleth.internet2.edu/ [26] R.O. Sinnott, J. Watt, O. Ajayi, J. Jiang, J. Koetsier, A Shibboleth-Protected Privilege Management Infrastructure for e-Science Education, submitted to CLAG+Grid Edu Conference, May 2006, Singapore. [27] R.O. Sinnott, J. Watt, O. Ajayi, J. Jiang, Shibboleth-based Access to and Usage of Grid Resources, submitted to International Conference on Emerging Trends in Information and Communication Security, Freiburg, Germany, June 2006. [28] OGSA-DAI – http://www.ogsadai.org.uk [29] GridSphere – http://www.gridsphere.org [30] Globus Toolkit – http://www.globus.org/toolkit [31] Web Services Resource Framework (WS-RF) – http://www.globus.org/wsrf [32] OpenSAML Project – http://www.opensaml.org [33] OpenSAML Development Wiki - https://authdev.it.ohio-state.edu/twiki/bin/view/Shibboleth/ OpenSAML [34] R. O. Sinnott, M. M. Bayer, J. Koetsier, A. J. Stell, Grid Infrastructures for Secure Access to and Use of Bioinformatics Data: Experiences from the BRIDGES Project, submitted to 1st International Workshop on Bioinformatics and Security (BIOS’06), Vienna, Austria, April, 2006. [35] R.O. Sinnott, M. Bayer, D. Berry, M. Atkinson, M. Ferrier, D. Gilbert, E. Hunt, N. Hanlon, Grid Services Supporting the Usage of Secure Federated, Distributed Biomedical Data, Proceedings of UK e-Science All Hands Meeting, September 2004, Nottingham, England. [36] R.O. Sinnott, A.J. Stell, J. Watt, Experiences in Teaching Grid Computing to Advanced Level Students, Proceedings of CLAG+Grid Edu Conference, May 2005, Cardiff, Wales. [37] R. S. Sandhu andd P. Samarati, “Access control: Principles and practice,” IEEE Communications Magazine, vol. 32, no. 9, pp. 40-48, 1994. [38] A. P. Sheth and J. A. Larson, “Federated database systems for managing distributed, heterogeneous, and autonomous databases,” ACM Comput. Surv., vol. 22, no. 3, pp. 183-236, 1990. [39] SQL Server – http://www.microsoft.com/sql [40] Oracle – http://www.oracle.com [41] National Cancer Institute, cancer Biomedical Informatics Grid, https://cabig.nci.nih.gov/ [42] Generation Scotland Scottish Family Health Study, http://www.innogen.ac.uk/Research/The-ScottishFamily-Health-Study
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
131
Privacy Protection in HealthGrid: Distributing Encryption Management Over the VO Erik TORRES a, b, 1, Carlos DE ALFONSO b, Ignacio BLANQUER b, Vicente HERNÁNDEZ b a
b
Centro Nacional de Bioinformática, Cuba Universidad Politécnica de Valencia – ITACA, Spain
Abstract. Grid technologies have proven to be very successful in tackling challenging problems in which data access and processing is a bottleneck. Notwithstanding the benefits that Grid technologies could have in Health applications, privacy leakages of current DataGrid technologies due to the sharing of data in VOs and the use of remote resources, compromise its widespreading. Privacy control for Grid technology has become a key requirement for the adoption of Grids in the Healthcare sector. Encrypted storage of confidential data effectively reduces the risk of disclosure. A self-enforcing scheme for encrypted data storage can be achieved by combining Grid security systems with distributed key management and classical cryptography techniques. Virtual Organizations, as the main unit of user management in Grid, can provide a way to organize key sharing, access control lists and secure encryption management. This paper provides programming models and discusses the value, costs and behavior of such a system implemented on top of one of the latest Grid middlewares.2 Keywords. Privacy of medical data, encrypted storage, medical grids, HealthGrid
1. INTRODUCTION Grid technologies have proven to be very successful in tackling challenging problems in which data access and processing is a bottleneck. The benefits of Grid-based applications on health are clearly identified [1, 2], since medical applications usually deal with large distributed data which must be considered at a global level (e.g. in epidemiology studies). HealthGrids, as Grids for healthcare, bring new tools, procedures and resources for patientcustomized therapy and epidemiological studies, improving clinical decision and diagnoses for better patient care. However, the development of HealthGrids, regardless of the success 1
Corresponding Author: ertorser@doctor.upv.es This work is partially funded by the Spanish Ministry of Science and Technology in the frame of the project Investigación y Desarrollo de Servicios GRID: Aplicación a Modelos Cliente-Servidor, Colaborativos y de Alta Productividad, with reference TIC2003-01318.
2
132
E. Torres et al. / Privacy Protection in HealthGrid
of prototypes and trials [DATAGRID, EGEE, GEMSS] is slow, mostly due to the legal constraints of medical data. Security in public networks and in Grids in particular has several risks, since users in a VO normally share data access rights. Moreover, the medical users must trust on the remote site protection, where users who grant administrator privileges can directly access the data. The ability to implement adequate confidentiality and privacy control in HealthGrid is both an ethical issue, affecting patient care, and a matter directly affecting the outcome of medical and clinical research, as discussed in [3, 4]. Securing privacy of confidential data stored in a Grid element remains unsolved. Storing medical data in encrypted coding will considerably reduce the risk of disclosure. A scheme that provides storing and accessing encrypted data on Grid storage, without compromise data sharing, has been proposed recently [5]. This work encourages the use of a Shamir’s secret sharing scheme for dividing a single key between several key servers. A Shamir’s secret sharing scheme is a means for N parties to carry shares or parts of a message, called the secret, such that any subset k of the shares determines the secret. This scheme is said to be perfect because no proper subset of shares leaks any information regarding the secret [7]. The present work describes an implementation of such architecture on gLite, an ultimate middleware for Grid computing. A model for the distribution of key shares and a model for revoking permissions are proposed as part of the implementation. Both models are completely consistent with existing Grid technologies and Grid security policies. Control access to both encrypted objects and decryption keys is managed in a Virtual Organization Membership Service (VOMS) environment [6]. Furthermore, VOMS enables the adoption of authentication and delegation mechanisms provided by the Grid Security Infrastructure (GSI). Finally we develop a methodology for the replication of key administration services. This methodology, as well as the synchronization of replicas, depends on the target environment. Next section in this paper describes the technologies used (encryption and decryption, key shares, permission revocation, key replication and integration with the gLite data management system). Section 3 describes the testbed, the test cases used and the results in terms of the evaluation of the security and the performance, and sections 4 and 5 presents the conclusions and acknowledgements.
2. METHODS 2.1. Encryption and Decryption Data and key shares locations are stored within each single encrypted file. The data is encrypted using the AES cipher provided by Bouncy Castle Crypto package. This cipher is operated in CBC mode with 128 bits keys (192 and 256 keys are also available). The encrypted file is signed with a Keyed-Hashing for Message Authentication Code (HMAC) using a SHA1 hash function on the output of the AES-CBC cipher. CBC mode not only hides pattern occurrences in plain data but also make possible to complete both the data encryption/decryption and the message authentication in a single file reading.
E. Torres et al. / Privacy Protection in HealthGrid
133
All the AES ciphers applied for a single object use a unique key, no matter the cipher operation mode. The encrypted files share a common structure: x x x
A header containing the key shares locations, the initialization vector for the AESCBC cipher encrypted with AES and the HMAC key encrypted with AES-CBC. Encrypted blocks of data using AES-CBC. HMAC signature.
2.2. Key Shares Distribution In a non-trusted environment privacy of data could be ensured enforcing a secret sharing scheme, distributing key shares across trusted participants [7]. A natural way to define the key sharing in a Grid environment is using Virtual Organizations (VOs) and Grid information services. The model presented in this article needs the object owner to decide the trusted VOs for key sharing. This approach ensures that only trusted key servers will be used. The VOs keeping a certain share as well as all the possible share combinations to rebuild the key are known during the whole process. Every encrypted object contains a header with the name of the VOs keeping key shares. VOs are responsible for both publishing key servers location using the Grid information services and maintaining replicas structure. VOs provide the model with flexibility and portability, ensuring a reasonable security level. An object administrator can distribute key shares over as many trusted VOs as needed, enforcing an authorized user to recover k shares, and ensuring that the exposure of an encrypted object, no matter which data server of the Grid is located in, needs accessing to k completely trusted VOs to rebuild the decryption key. In such scenario, the complexity of granting unauthorized access to the system requires compromising the security of, at least, k key servers held by k completely trusted VOs. A different AES decryption key is randomly generated for each object and translated into an integer. Key shares are computed using the Shamir secret sharing scheme [7]. Shares are distributed among trusted VOs and stored in a MySQL relational database linked to the VO. 2.3. Revoking Permissions A reliable and functional permission revocation model could be provided in a Grid environment by taking into account three rules: x x
Different keys should be associated to different encrypted objects. A copy of the Message Authentication Code (MAC) signature of the encrypted object should be kept both in the key servers having the key shares for the object and in the encrypted object itself. x Re-encryption of updated objects should use renewed passwords. On the view of these conditions, objects will be protected from unauthorized users, even from those who grant physical or administrator access to the storage device. Using a
134
E. Torres et al. / Privacy Protection in HealthGrid
different key for each encrypted object ensures that a user, who kept a decryption key after permission revocation, could not expose objects different from those ones for which he or she had access in the past. The integrity of the objects is ensured by cross validating the copy of the MAC signature stored within the encrypted object with the copies stored in the key servers. Again, the complexity of compromising the integrity of an object is the problem of compromising at least k key servers. Changing the password of updated objects prevents the exposure of new versions. Permission revocation is the act of updating the information of a user in the VOMS servers. Each service provider, including key servers, will evaluate new requests using the updated credentials emitted by VOMS servers. 2.4. Key Administration Services Replication Replication was implemented on the basis of MySQL database server replication capabilities [8]. Key servers are completely equivalents for clients, so, requests can be issued to any replica. Clients are required to access the key servers in a randomly fashion, to improve load balancing. A best way of ensuring load balancing over the replicas of a VO is to temporally remove the overloaded replicas from the information services, enforcing new clients to use unloaded replicas. gLite middleware provides low-level monitoring services as part of the information services. These services include the ability to feed a consumer with information that carries a timestamp. Client application implemented as part of the system consumes data published by the key servers keeping up-to-date with producer (key servers) events. In this way, an overloaded key server can transmit a signal to all listening clients indicating to use a new replica. Key servers receipt client’s requests and validate VOMS proxy certificates against local policies. Once a request is validated, further steps are tied to the nature of the request. A master data server performs insertion and update operations, and replicas perform read operations. For read-only requests a replicated data server is randomly chosen, and the rest is handled by the master. Only one master data server is operated by VO. Then, all modifying operations on the databases are done by this master. Furthermore, the master data server ensures the synchronization of the replicas. This approach improves time responses and load balancing when reading is the main operation. In fact, this condition is the habitual scenario in most health applications. 2.5. Integration with the gLite Data Management System In the gLite middleware, data functionality is provided by a set of interoperable services. The end-user application only needs to use the gLite Input/Output API in order to access its data. Grid security services are used indirectly by this API. Client application implemented as part of the system uses the gLite I/O client libraries to access encrypted objects for read and write. This client application handles a set of proxy certificates that
135
E. Torres et al. / Privacy Protection in HealthGrid
confirms the user is authorized by a trusted authority to access encrypted objects and decryption keys. Proxy certificates should be previously negotiated with VOMS servers.
3. RESULTS AND DISCUSSION 3.1. Testbed 6 desktop PC workstations were used to fix a gLite testbed shared by 3 different VOs. Clients where test were performed were also commodity desktops PC workstations. Clients and servers were connected to a dedicated Fast-Ethernet network. Key servers were implemented as Web Services deployed with gLite. The list of gLite services and modules used is the following: x x x x x x x x
gLite Security Utilities. gLite R-GMA Server. gLite VOMS Server. gLite Data Single Catalog (for MySQL). gLite I/O Server. gLite R-GMA Client (API Java). gLite I/O Client. gLite UI.
3.2. Test Cases 24 medical images were used in tests. Sample data was collected from public image repositories. Images were selected considering how representative are the image acquisition techniques, the resolution and the storing formats. Table 1. Description of the Sample Dataset Source
Num.of Samples
Max. Sample Size (Mb)
Acquisition
Image Format
Public Health Image Library at Centers for Disease Control and Prevention (CDCP)
8
14.0
X-Ray, SEM
Tiff
DDSM: Digital Database for Screening Mammography
4
26.0
Mammography
Tiff
Osiris Medical Imaging Software3
8
0.51
PET, CT
PNG, GIF, DCM
DICOM Sample Image Datasets Web Site
4
0.51
MRI
DCM
3 DICOM is the OsiriX original image format. Format conversion was done using XMedCon, an open source medical image conversion utility & library.
136
E. Torres et al. / Privacy Protection in HealthGrid
3.3. Validation of the Implemented Architecture A first group of tests was completed in order to validate the implemented architecture. A comparison of an image set before and after encryption-decryption using this system proved that the resulting images are equivalent. No data corruption was observed during the cryptographic processes and keys were successfully retrieved. The original set includes all test cases. System response was measured on different authorization scenarios. Expired proxy certificates, as well as invalid certificates, certificates signed by non trusted Certificate Authorities and valid certificates issued by a VO different from the key server VO were proved to be rejected by the system. Valid certificates are forced to support local accession policies. 3.4. Evaluation of the Security Levels and Overhead A second group of tests was used for evaluating the security levels introduced by the models and the overhead. The capability of response of the system to potential security violations was evaluated. Shamir's secret sharing scheme is a threshold scheme where the secret, decryption keys in our case, is divided in N shares but just k shares, with k
E. Torres et al. / Privacy Protection in HealthGrid
137
3.5. System Tuning A third group of tests was performed focusing on system tuning. Tests were designed to find the most efficient way for setting key servers and replicas in order to minimize key distribution and recovery times with no security compromise. An optimum key management scheme can be reached by finding a commitment in security, reliability and usability. When using a (k, N) threshold secret sharing scheme, this commitment can be found by tuning k and N values. The security is the difficulty of reaching key shares without authorization. N=2k-1 ensures the participation of, at least, one and a half key servers in key rebuilding. The simplest case is k=2 shares needed to rebuild the key (with k=1 there is no key sharing) and N=3 shares distributed over three different VOs. The reliability is the guarantee that authoritative users can always recover decryption keys. Reliability requires key sharing and redundancies. However, the level of redundancy for security and reliability is not necessary the same. The use of N=k+b shares ensures that keys could be recovered when no more than b key servers fail. With b=k+1 we are assuming again the N=2k-1 relation, ensuring that the fail of up to the half of key servers will not produce the loss of the secret. The usability depends on the overhead associated with the system. As we have shown before, overhead is the sum of the time invested on the cryptographic processes and the time due to the communication related to key composition (or decomposition). Cryptographic processes are done at the client side and the only way to speed-up this process is to provide faster ciphers with the implementation. Communication is then the main issue when finding the balance between the cost and the usability. Communication can be expressed in terms of shares needed to rebuild the key. In a read-only prevailing environment, communication overhead is linearly dependent with k (k key servers contacted and k messages passed for rebuilding the key). Optimum overhead is reached using the minimal number of shares k, without security and reliability compromises. Therefore, the minimal combination is (k=2, N=3). A main issue is the system response in overloaded environments, where key servers are busy and denies of service (DoS) could produce key losses. Using the straightforward threshold of three parts and two fragment needed for key rebuilding (k=2, N=3) we fix a distribution with 3 non-replicated key servers and 3 clients continuously producing requests. The configurations are described in Table 2. Two clients request read-only operations, and the third client requests data actualizations (read and write operations). The effect of adding replicated key and data servers on the response time were measured and shown in Table 3.
138
E. Torres et al. / Privacy Protection in HealthGrid Table 2. Key Servers and Replicas Configurations in Tests Key Servers
Data Server
Total Number
Replicas
of Computers
3
0
3
4
3
0
4
5
3
0
5
3
6
3
0
6
3
3
6
2
6
Test
k
N
3-0
2
3
3
4-0
2
3
5-0
2
3
6-0
2
3-2
2
Data Servers
The starting configuration is 3 pairs of key and data servers, and was fixed using 3 computers (3-0). Every key-data server pair is operated by a different VO. Key servers and data servers of the same VO were placed in different computers ensuring that all messages were passed by means of the network. The configurations 4-0, 5-0 and 6-0 add a computer each. Every new computer adds a replicated key server to one of the existing VOs, in such a way that configuration 6-0 means 2 key servers per VO. Configuration 3-2 starts from 3-0 and, similar to 6-0, adds 3 computers, but, in this case, data servers are replicated instead of key servers. In the 3-2 configuration, every single VO has one key server, one master data server and one slave data server replicating master write operations. Wall clock time for the completion of 100 operations in each client is reflected in Table 3. All tests use the same sample image. Table 3. Response Times in Seconds and Changes Respecting the Starting Configuration. 3-0
4-0
5-0
6-0
3-2
Client 1
3.310
3.130
3.165
3.170
3.149
Client 2
4.181
3.966
3.967
4.222
3.866
Client 3
5.004
4.936
4.948
5.032
4.926
Average
4.165
4.011
4.027
4.141
3.980
Change
–
-3.70%
-3.32%
-0.56%
-4.43%
Observed times were different for clients obeying to different performance and requests, as shown in Table 3 and Figure 1. Client 3 performs object actualizations whereas clients 1 and 2 read and decrypt objects. Response pattern was similar for all clients when the number of key servers and database replicas change. Changes associated with the transition from the starting configuration 3-0 to a new configuration can be expressed as the percent of the reason between the difference in times of the final stage with the initial, and the time observed with the starting configuration: Change = 100 (tfinal - t3-0) / t3-0
139
E. Torres et al. / Privacy Protection in HealthGrid
Negative values mean improvements in response times, whereas positive values mean performance losses. Values can be used to measure the rate of these changes. 6.000 5.000
time(sec)
4.000
3-0 4-0
3.000
5-0 6-0
2.000
3-2
1.000 0.000 client1
client2
client3
Figure 1. Response Times per Clients
In general, response time reduces with the increasing of the number of key servers (see Figure 2). Apparently, there are no simple criteria for the improvement, and there is a saturation value. The configurations 4-0 and 5-0 are nearly equivalent, improving response times in 3%. In the other hand, the time improvements observed in configuration 6-0 are not so evident (0.5%). Indeed, clients 2 and 3 have larger times for this configuration than times observed with 3-0. This result shows us how carefully replication must be implemented. The last computer added to configurations has poor performance and small amounts of RAM installed. These facts make the response times of the whole system were worse. Instinctively, an increasing number of computers in the system must reach to a shorter delay. In practice however, there are further essentials that should be observed when tuning the system in order to improve response times. 4.200
4.150 3-0
time(sec)
4.100
4-0 5-0
4.050
6-0 3-2
4.000
3.950
3.900 test
Figure 2. Average Response Times
140
E. Torres et al. / Privacy Protection in HealthGrid
An increase in the number of data sever replicas also reduces response time, as shown in the transition between configurations 3-0 and 3-2. Although key updates involve an additional delay for data server synchronization, observed times for this configuration are the best times in the experience. On the other side, the configurations 3-2 also uses 6 computers similarly to 6-0 but the response times are considerably lower. In the configuration 6-0, the probability of contacting the key server disposed in the last computer is almost the same than the probability of contacting the slave data server disposed in the same computer when using the configuration 3-2. Difference is then in the cost of the services. The operation of a key server has an additional penalty of approximately 0.2 seconds compared to the operation of a slave data server. The proposed study reflects the fact that the estimation of the optimum number of key and data server replicas depends on the requirements of the environment and on the individualities of the available network and hardware. Despite that the configuration used (k=2, N=3) is the minimal rational threshold, some cases may need a larger number of key shares. This study represents a methodology and a starting point for future configurations providing an idea of the behavior of the system. Probably, response times will shape deeper differences for heavy-traffic loaded environments, for example with more clients issuing requests at the same time. Such cases will prove in a deeper manner the importance of finding an optimal key management configuration. Test environment presented in this study is highly heterogeneous with a large number of state variables. The intention of this paper is to simplify the scenario and presents an advantageous distribution model. However, heterogeneity is an inherent condition of genuine Grid environments and it is on this sort of systems where the presented models should be constructed and valuated.
4. CONCLUSIONS AND FUTURE WORK One of the most outstanding problems in Grid adoption is the accomplishment of functional and computational requirements that applications presents to Grid. Nevertheless, the architectural and programming constrains that Grid imposes on applications is not always well understood. The present work discuss a cost-effective implementation of the architecture for encrypted storage of medical data on a Grid proposed in [5], extended with distribution key and permission revocation models that ensures a high level of security and quality of service. The proposed encrypted object storage format enhances the security and the scalability of the architecture. The security is enhanced by means of a mechanism for guaranteeing the authenticity and the integrity of the encrypted objects using MAC signatures. This mechanism transforms the architecture in a self-enforcing scheme. The storage of key shares locations inside the encrypted objects reduces the dependence of the system with
E. Torres et al. / Privacy Protection in HealthGrid
141
centralized services, improving the scalability and contributing the avoiding of single points of failure. Models were validated on a real Grid environment using real authorization scenarios. The use of key servers distributed over trusted VOs is a natural way of linking the architecture to Grid security mechanisms. On this basis, data privacy lays in the selection of trusted and reliable key servers. Future efforts should be done in the performance of key servers and client applications. A costless mechanism for cross-linking combining MAC signatures should be designed. Security matters concerning Grid authorizations models, for example local policy consolidation mechanisms, should continue to be observed in the future.
5. REFERENCES [1] I. Blanquer, V. Hernandez, S. Lloyd, R. McClatchey, J. Montagnat, H. Bilofsky, T. Solomonides, I. Castro, B. Claerhout, J. Herveg. “Healthgrid White Paper”, on behalf of the White Paper Collaboration Group. http://whitepaper.healthgrid.org/ [2] V. Hernandez, I. Blanquer, “The Grid as Healthcare Provision Tool”, Methods of Information in Medicine. 2005, vol 44: 144-148. ISSN 0026-1270. [3] B. Claerhout, G. De Moor. “From Grid to HealthGrid: Introducing Privacy Protection”. In Proceedings of the First European HealthGrid Conference, p. 226-233. (Jan 16th-17th, 2003). Document from Information Society Technologies, European Commission. [4] N. Zhang, A. Rector, I. Buchan, Q. Shi, D. Kalra, J. Rogers, C. Goble, S. Walker, D. Ingram, P. Singleton. “A Linkable Identity Privacy Algorithm for HealthGrid”. Stud Health Technol Inform. 2005;112:234-45. [5] L. Seitz, J. M. Pierson, L. Brunie. “Encrypted Storage of Medical Data on a Grid”. Methods of Information in Medicine 2005; 44(2):198-201. [6] R. Alfieri, R. Cecchini, V. Ciaschini, L. dell’Agnello, Á . Frohner, A. Gianoli, K. Lörentey, F. Spataro. “VOMS: An Authorization System for Virtual Organizations”. In Proceedings of the 1st European Across Grids Conference, 2003. [7] A. Shamir, “How to Share a Secret”. In Communications of the ACM, volume 22, pages 612–613, 1979. [8] I. Gilfillan. “Database Replication in MySQL”. Database Journal. May 18, 2004.
142
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Secured Distributed Service to manage Biological Data on EGEE grid a
Christophe Blanchet a,1, Rémi Mollon a, and Gilbert Deléage a Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1; IFR128 BioSciences Lyon-Gerland; 7, passage du Vercors, 69007 Lyon, France
Abstract. Biological data are most times published and then become public ones. They, then, do not need to be isolated or encrypted. But, in some cases, these data stemed from patients or are analyzed with, for instance, pharmaceutical or agronomics goals. Also in simple ways , these data, before to become public, have to be kept confidential while researchers haven’t been able to publish their work or to register them. So they are a lot of cases where the integrity and the confidentiality of biological data have to be protected against unauthorized accesses. But, as these private data are also large datasets, they need highthroughput computing and huge data storage to processed, such as ones produced by complete genome projects. These requirements are enhanced in the context of a Grid such EGEE, where the computing and storage resources are distributed across a large-scale platform. We have developed a secured distributed service to manage biological data on grid: the EncFile encrypted files management system. We have deployed it on the production platform of the EGEE grid project. Thus we provided grid users with a user-friendly component that doesn’t require any user privileges. And we have integrated into a bioinformatics grid portal associated to encrypted representative biological resources: world-famous databases and programs. Keywords. Biological data, grid computing, secured data, encryption
Introduction Bioinformatics needs high-throughput computing and huge data storage to understand datasets such as ones produced by complete genome projects [1][2]. But these data are linked to patients, and used in scientific or industrial processes such as drug design and gene function identification. These use cases need to have a certain level of confidentiality and integrity to preserve the patient privacy or the patent secret. Obviously important in a local computing context such as supercomputer or cluster, these requirements are exacerbated in the context of a Grid such EGEE, where the computing and storage resources are distributed across a world-wide platform. Biomedical applications are pilot ones in the EGEE project and manage a devoted virtual organization: the “biomed” VO. Thus, biomedical science has specific security requirements such as electronic certificate system, fine grain access to data, encrypted storage of data and anonymity. Certificate system provides biomedical entities (like users, services or Web portals) with a secure and individual electronic certificate for 1 Corresponding author: Institut de Biologie et Chimie des Protéines (IBCP UMR 5086), 7 passage du Vercors, 69007 Lyon, France; Christophe.Blanchet@ibcp.fr
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
143
authentication and authorization management. One key quality of such a system is the capacity to renew and revoke these certificates across the whole grid. Biomedical applications also need fine grain access (with Access Control Lists, ACLs) to the data stored on the grid: biologists and biochemists can then, for example, share data with colleagues working on the same project in other places. This data also need to be gridified with a high level of confidentiality because they can concern patients or sensitive scientific/industrial experiments. The solution is then to encrypt the data on the Grid storage resources, but to provide authorized users with transparent and unencrypted access. In this paper we describe the security requirements from biomedical applications about file encryption in the EGEE project. We give some examples of biological databases and applications for protein sequence analysis that are a representative model for encrypted file use cases. We present the model we have built for encrypted-file management, and the system we have deployed and are using today on the production platform of the EGEE grid, over the LCG2 middleware.
1. Biological data and protein sequence analysis applications Biological data and bioinformatics programs have both special formats and behaviors, which are highlighted especially when they are used into a distributed computing platform such as grid [3]. Biological data represent very large datasets of different nature, from different sources, with heterogeneous models: protein three-dimensional structures, functional signatures, expression arrays, etc. They are currently more than 700 biomolecular available databases [4]. These worldwide databases are available on Internet through for example HTTP or FTP transfers. Moreover, biological data are not static, every day, new ones are published and existing ones may be updated, like with Swiss-Prot, a world-famous protein database [5]. Thus, biological databases need to be periodically updates, and classified as major or minor releases. But scientists using these databases need to access them exactly on the same way than before the last release [6]: under the same filename, under the same index in a database management system, … Bioinformatics experiences use and produce very different and numerous methods and algorithms to analyze whole biological data, which are available to the community [7]. For each scientific domain of Bioinformatics, they are very often several different high-quality available programs for computing the same dataset in as many ways. Most bioinformatics programs are not adapted to distributed platforms. One of largest disadvantage is certainly that they are only accessing data with local file system interface to get the input data and to store their results, and that these data must be unencrypted to be read. In Figure 1, a simple model, centered round the program, describe the kind of I/O of most bioinformatics programs, The processed data during the computation could be provided by the user or extracted from databanks such as these described in Table I. The input data can be (i) one or more parameters, (ii) user data, and (iii) databanks. The program always received input data through the standard command line and file system interfaces. The main differences between user data and databases are size of these data and their location on distributed platform. The user data are transferred at job submission from the user interface to the remote computing node. And databases are
144
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
available at public uniform resource locators (URL), which can be FTP, HTTP, or specific to the distributed platform
Figure 1. Algorithm-centered model of protein sequence analysis data and software. Simple execution model where user gives as input to the algorithm suitable kind of data and files, including large reference databases, and get the files containing the results of the biological analysis and/or subset of these input database.
Among the numerous set of available bioinformatics programs, we have chosen several ones for their representativeness: FastA [8], SSearch [9] and BLAST [10]. We have token them as models along this work, because of their special requirements for input or output of data. Indeed, they need to have access to large dataset as reference to compute their analyses, to protein sequences databases. They also produce large data sets as result of this computation, which can be a subset of the input protein sequences. This sub-database could then be pipelined to other bioinformatics program. Table 1. Bioinformatics Resources Deployed on the Egee Grid Resource
Grid Descriptor
Swiss-Prot
lfn://genomics_gpsa/db/swissprot/swissprot.fasta
And Blast indexes
lfn://genomics_gpsa/db/swissprot/swissprot.fasta.phr lfn://genomics_gpsa/db/swissprot/swissprot.fasta.pin lfn://genomics_gpsa/db/swissprot/swissprot.fasta.psq
TrEMBL
PROSITE
lfn://genomics_gpsa/db/trembl/trembl.fasta
lfn://genomics_gpsa/db/prosite/prosite.dat lfn://genomics_gpsa/db/prosite/prosite.doc
ClustalW
ESM tag “genomics_gpsa_clustalw”
SSearch
ESM tag “genomics_gpsa_ssearch”
Examples of biological databases and bioinformatics programs we have registered and deployed onto the EGEE grid. Database files have been encrypted and registered as logical files into the replica manager system, with their own logical filename (LFN, lfn://), and programs with an tag of the experiment software manager (ESM tag)
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
145
2. Grid computing Grid computing concept defines a set of information resources (computers, databases, networks, instruments, etc.) that are integrated to provide users with tools and applications that treat those resources as components within a « virtual » system [19][20][21]. A grid middleware provides the underlying mechanisms necessary to create such systems, including authentication and authorization, resource discovery, network connections, and other kind of components. 2.1. The European EGEE grid The Enabling Grids for E-sciencE project (EGEE [11]), funded by the European Commission, aimed to build on recent advances in grid technology and to develop a service grid infrastructure. The EGEE consortium involves 70 leading institutions in 27 countries, federated in regional Grids, with currently a combined capacity of 17,000 CPUs and 5 petabytes of storage. The platform is built on the LCG-2 middleware, inherited from the EDG middleware developed by the European DataGrid Project (EDG, FP5 2001-2003). The middleware LCG-2 is based upon the Globus toolkit release 2 (GT2) and the Condor middleware. A new middleware gLite [11], is being developed to improve the performances and the services provided by the future EGEE platform. The EGEE middleware provides grid users with an “user interface” (UI) to launch a job. The main commands to run a job are: job submission (edg-job-submit), progress report (edg-job-status) and results download (edg-job-get-output). There are several important components into the EGEE grid: the “workload management system” (WMS) responsible of the jobs scheduling on the platform. The scheduler (or “resource broker”) determines where and when to compute a job: (i) using one “computing element” (CE) near one “storage element” (SE) containing the data in case of simple jobs, or (ii) several CEs an SEs in case of larger jobs. A computing element is a gatekeeper to a cluster of several CPUs, the worker nodes (WN) managed by a batch scheduler system. The “information system” that centralize all parameters raised by the grid components (CPUs, storage, network,…). A other key-component of a grid is the “data management system” (DMS), especially when you plan to run bioinformatics applications. 2.2. Distributed storage on the EGEE grid The “data management system” (DMS) on the EGEE grid is a key service for our bioinformatics applications. Having efficient usage of it will be synonymous of good distribution of our protein sequence analysis applications. The main component involved into this mechanism is the storage element, which stores the file on different kind of medium: disk or tape. These data may have metadata attached to, with access restricted to the system or allowed to the application/user. Following this design, the data manager in EGEE has different functionalities: providing user with command for data registration and replication, and sharing data information with the WMS for scheduling job near the given data. On the EGEE grid, the replica manager system provide user with data management functionalities such as: data registration (lcg-cr), data replication (lcg-rep) and data suppression (lcg-del). All these commands are available through command line
146
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
interface (CLI) or through application programming interface (API). The data are stored as file on the storage elements. These files are identified in a logical namespace with GUID (Grid Unique IDentifier) and LFN (Logical File Name) that are pointing to local occurrences on the SE: the SFN (Storage File Name), absolute pathway on the given storage element. Despite there are tools to get the SFN from LFN or GUID, these are no automatic substitution mechanisms on the application command line. A legacy bioinformatics application launched on the EGEE grid won’t be able to access the remote data stored on SEs, except if we download it on the given worker node before to execute the program.
3. Biomedical requirements about file encryption Biomedical applications are pilot in the EGEE project and have strong requirements about data security, integrity, authentication and authorization. One key point is encrypted storage to prevent from storage system failure or cracking, and malicious system administrator. Indeed, nowadays, nobody is able to affirm that his computer, or his computer network, is totally secure. Moreover, in a context as distributed as the grid one, a lot of system administrators take part of the grid management. So, it would be a too strong assumption to trust all of them. But encrypting needs cryptographic keys, and these ones have also to be stored somewhere. And if they would be in a centralized key server, the storage element related problems would only be moved to this server. Moreover, end users do not be able to get the cryptographic keys, even if they are authorized to access the corresponding file. Another requirement is to use the bioinformatics programs as “black boxes”, what means that their source code can't be modified. The main reasons for this are they are too numerous, and their source code is very often not available. The consequence is that it's the encrypted file management system that has to be adapted to applications, and not the opposite. In other words, accesses to encrypted gridified files must be completely transparent for biomedical applications. And then, users don't need keys. So to grant them access to such information is useless, and could endanger the system if one of them would distribute these keys to others – intentionally or not. An acceptable solution would obviously be to store files in a encrypted format, to not reveal corresponding keys to users, and to not centralized them in an unique server, which would be vulnerable to a system failure or compromising. The last requirement is that it must be able to work with the current production platform of EGEE project: the one using LCG-2 middleware. One of the EGEE workgroups is developing an encrypted file management system, partially fulfilling biomedical requirements. Indeed, gridified files accesses - encrypted or not – aren't transparent for applications: they have to use an API. Moreover, cryptographic keys are currently stored in a single key-store, which is so a single point of failure of the system, and a very interesting target for an attacker. Finally, this system is available on the pre-production platform of the EGEE project, the one built upon the gLite middleware, and not on the production platform. Its advantage is that it inherits the ACL support of the gLite middleware, a functionality that is not available on the production platform for the moment.
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
147
4. EncFile, securing biological data on Grid The Grid context is very distributed, and so data security is a tough problem. Files are stored on specific component, the storage elements, but also on worker nodes for computation, and system administrators can read them at this moment. To deal with such a problem, we encrypt sensitive biological files on the EGEE storage elements. We use the AES algorithm (Advanced Encryption Standard, FIPS 197) with keys of 256 bits. Indeed, AES has good security properties and performances: decrypting a 200 MB files takes 22 seconds on a computer with two PentiumIII 1.0 GHz and 1GB RAM. Each file is encrypted with one different key stored into a key service (see Figure 2). Grid services are distributed, and the main reason is to bring fault tolerance properties to the platform. Another reason is to have only limited trust on system administrators. Thus, we have also applied this distribution principle and have used, to store keys, M-of-N technique from Shamir secret sharing algorithm [14]. We split a key into N shares, each stored in a different server. To rebuild a key, exactly M of the N shares are needed. With less than M shares, it is impossible to deduce several bits or even one of them. In this way, we can consider that the key servers don’t need to be secured more than one installed with good practices. Indeed, an attacker need to compromise at least M servers in order to be able to reconstruct the key, and then decrypt the encrypted data. The “EncFile” system is composed of the N key servers and one client (see Figure 1). Each key server is a PostGreSQL server storing one of the N shares of all the keys in its tables. EncFile client is doing the decryption of the file, and is the only component able to rebuild the keys and guarantee their confidentiality. It must be almost impossible for users to retrieve them.
Figure 2. Integration of the encrypted file management system in the EGEE platform (UI: user interface, CE: computing element, SE: storage element, WN: worker node, DB: database, LFN: logical file name).
148
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
The transfer of the keys between the servers and the client is very important. We secure them by using the OpenSSL library with encryption and mutual authentication. In order to determine user authorization, the client uses his grid certificate to authenticate himself. Nonetheless, to avoid that a malicious person creates a fake EncFile client (e.g. to retrieve keys), a double authentication is done, once with the specific certificate of the EncFile client, and once with the user one. It's important to note that this system enforces data confidentiality and integrity, but it doesn't protect files from a malicious attacker who would success to erase them. For that, it's necessary to implement a replication mechanism, which ensures that files are always replicated on a minimum number of storage elements.
Figure 3. Architecture of the EncFile client integrated into Perroquet. It is catching the local I/O and forwarding them through network to remote storage element. Several transfer protocols are available such as FTP, HTTP, and especially GSI-FTP which is authenticated, crypted and fully compatible with the EGEE grid middleware, component and services. Perroquet recognizes EGEE LFNs , like lfn://genomics_gpsa/db/Swissprot.fasta, and is encrypting and decrypting data “on the fly”
As explained previously, most of the bioinformatics programs are only able to access their data to local file system interface, and also to work only on plain data. To answer to these 2 strong issues, we have combined the EncFile client and the Parrot software [15]. This EncFile client (called Perroquet in Figure 2), which acts as a launcher for applications, catches all their standard input/output calls, and replaces them with equivalent remote calls to handle several remote file protocols. Perroquet, is based on Parrot for catching the I/O calls, but we have modified it to made it compliant with the EGEE production platform and to be able to work with the logical file name (see Table 1) of our biological resources onto the EGEE grid. Currently the supported protocols are: http, ftp, gsiftp, lfn. Moreover, to encrypt and decrypt files, we have
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
149
integrated an EncFile client in it. This enables on-the-fly decryption, and so decrypted file copies aren't needed. This has mainly two consequences: (i) higher security level because decrypted file copies could endanger data, (ii) better performances because files aren't read twice to locally copy and to decrypt. Thus, the EncFile client permits any applications to transparently read and write encrypted, or not, remote files as if they were local and plaintext. The EncFile system is used to secure sensitive user data on the EGEE resources to be used by the selected protein sequence analysis programs FastA, SSearch and BLAST. Programs, which run on the grid worker nodes, can transparently store files into the replica management system of EGEE (RMS) with EncFile, and access them, as well as other remote files (see Figure 3).
Figure 4. Download time on the EGEE platform of a 500,000 sequences file (around 200 Mbytes). We have compared the use of the lcg-cp and the perroquet programs, with plain and encrypted file. In a), The file is downloaded from a remote storage element (not the near-SE) to a given worker node. In b), the file is download from the nearSE of the given worker node, and decrypt on-the-fly
We have done some benchmarks with the selected biological data and protein sequence algorithms to study the effect of the cryptographic mechanism on file access. The performances of EncFile system were quite good (see Figure 4). Indeed, despite the cryptographic overhead – the cost for manipulating encrypted data (decryption overhead) was estimated over large files (200 MB) – the jobs using EncFile client to access encrypted files are faster than the same jobs but with middleware commands and plain-text files. And, when we use Perroquet, time difference between jobs with encrypted files, and the ones with plain-text files is very close to zero. Comparing a) and b) in Figure 4 shows that the ratio of time to get and decrypt 200MB file is of 3 if
150
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
the file is get from any storage element than from the near storage element. Enhancing the importance of an efficient replication mechanism of the encrypted biological data that will be used by bioinformatics programs on the EGEE grid.
5. Integration of secured biological data into a bioinformatics grid portal GPS@ grid portal (Grid Protein Sequence Analysis, http://gpsa.ibcp.fr) is a bioinformatics integrated portal devoted to protein sequence analysis on the EGEE grid [1][2]. This grid portal is the grid port of the Network Protein Sequence Analysis (NPS@, [7]). GPS@ portal provides the biologist with a user-friendly interface for the EGEE grid resources, computing and storage. GPS@ acts as a high-level grid user interface hiding to biologist the mechanism of the EGEE workload management system. All steps involved for the execution of bioinformatics jobs on the grid infrastructure are numerous and complex. Grid user can submit a job through command line interface on a EGEE user interface. We have integrated all the calls to the grid middleware getting status of the submitted job. GPS@ user has only to submit. The bioinformatics algorithms and biological databases have been distributed and registered on the EGEE grid and GPS@ runs its own EGEE interface to the grid. The given biological databases have been distributed on EGEE with the lcg commands. lcg-cr to register the database file and to copy it on the grid. Despite the replica management system have no file tree, we have decided to create one by given logical filename according to the following model lfn://genomics_gpsa/db// (see Table 1). These LFNs have been replicate with the lcg-rep commands on several sites without applying any particular model of replication. Bioinformatics programs have also been integrated to the grid services (see Table 2) with the same goals, but in a slightly different way. We have used the experiment software management service (ESM) that permit to distribute package on remote sites of the grid with the agreement of the site administrator. We have also define a namespace to distinguish our bioinformatics applications from the other field ones. Our ESM tags follow this model: genomics_gpsa_. Table 2. Example of Bioinformatics Algorithms Gridified into the GPS@ Portal Algorithm
Class
Input databank
Gridified
BLAST
Similarity
Sequence
EGEE
FASTA
Similarity
Sequence
EGEE
SSearch
Similarity
Sequence
EGEE
ClustalW
MSA
no
EGEE
Multalin
MSA
no
EGEE
PattInProt
Pattern/Profile
Sequence, Pattern, profile
EGEE/DIET
Pfsearch/ pfscan
Profiles
Sequence, Pattern, profile
EGEE
GOR4
PSSP
no
EGEE
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
Algorithm
Class
Input databank
Gridified
SIMPA96
PSSP
no
EGEE
SOPMA
PSSP
no
EGEE
151
(PSSP: Protein secondary structure prediction; MSA: Multiple Sequence Alignment)
Both the deployed LFNs and ESM tags play a key-role into the scheduling of our future jobs. Indeed, the jobs submitted, with LFN and/or ESM tag used into the job description file, will be send according to a match-making between these symbols and the free sites hosting one physical replicate of them. All the described biological databases, even though they are public ones and so not confidential, have been gridified on the EGEE grid in an encrypted way for the purpose of the demonstration. Then we have applied on these encrypted biological databases all the gridified algorithms available on the GPS@ portal (see Table 2). All these worldfamous legacy programs, such as BLAST, ClustalW or pfsearch, are only able of local file access and on plaintext, non-encrypted, files. With the help of the EncFile system, the perroquet client has provided these bioinformatics job with remote and encrypted data, as if they were local and unencrypted ones. 6. Conclusion We have deployed several representative biological data in encrypted state onto the worldwide EGEE grid. These resources have been registered with canonical logical names, and thus are available to all biologists and bioinformaticians participating to the EGEE “biomed” virtual organization. We have developed the EncFile encrypted files management system, and deployed it on the production platform of the EGEE project. Thus we provided grid users with an user-friendly component that doesn’t require any user privileges. To demonstrate, we have integrated this secured data service into the GPS@ Web portal on the EGEE production platform. Users of the Web portal have been able to submit standard bioinformatics analyses with world-famous legacy applications on encrypted biological data in a transparent way. Future works will be done on the adaptation of our EncFile system to other distributed system as EncFile is not linked to the EGEE grid components. This will bring access to other biological resources stored into different storage system, and to others bioinformatics methods integrated into other distributed platform.
7. Acknowledgement This works was partially supported by the European Union (EGEE project, ref. INFSO508833). Authors express thanks to Douglas Thain for the interesting discussions about the Parrot tool.
152
C. Blanchet et al. / Secured Distributed Service to Manage Biological Data on EGEE Grid
References [1]
[2]
[3]
[4]
[5] [6]
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
Jacq, N., Blanchet, C., Combet, C., Cornillot, E., Duret, L., Kurata, K., Nakamura, H., Silvestre, T., Breton, V. : Grid as a bioinformatics tool. , Parallel Computing, special issue: High-performance parallel bio-computing, Vol. 30, (2004). Breton, V., Blanchet, C., Legré, Y., Maigne, L. and Montagnat, J. : Grid Technology for Biomedical Applications. M. Daydé et al. (Eds.): VECPAR 2004, Lecture Notes in Computer Science 3402, pp. 204–218, 2005. Desprez, F., Vernois, A., Blanchet, C.: Simultaneous Scheduling of Replication and Computation for Bioinformatic Applications on the Grid. ISBMDA 2005, Lecture Notes in Computer Science 3745: 262-273 Galperin, M.Y.: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research 33 (2005) National Center for Biotechnology Information and National Library of Medicine and National Institutes of Health. Bairoch, A, Apweiler, R : The SWISS–PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27 (1999) 49-54 Perriere, G, Combet, C, Penel, S, Blanchet, C, Thioulouse, J, Geourjon, C, Grassot, J, Charavay, C, Gouy, M, Duret, L, Deleage, G..: Integrated databanks access and sequence/structure analysis services at the PBIL. Nucleic Acids Res. 31, (2003) 3393-9. Combet, C., Blanchet, C., Geourjon, C. et Deléage, G. : NPS@: Network Protein Sequence Analysis. Tibs, 25 (2000) 147-150. Pearson W.R. ; Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. PNAS (1988) 85:2444-2448 Smith T.F., Waterman M.S. : Identification of common molecular subsequences. J. Mol. Biol. (1981) 147:195-197 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., Basic local alignment search tool. J. Mol. Biol. 215 (1990) 403–410. Enabling Grid for E-sciencE (EGEE) www.eu-egee.org Foster, I. And Kesselman, C. (eds.) : The Grid 2 : Blueprint for a New Computing Infrastructure, (2004). Vicat-Blanc Primet, P., d’Anfray, P., Blanchet, C., Chanussot, F. : e-Toile : High Performance Grid Middleware. Proceedings of Cluster’2003 (2003). Shamir., A. “How to share a secret”. Communications of the ACM , 22(11):612–613, Nov. 1979. Thain, D. and Livny, M.: Parrot: an application environment for data-intensive computing. Scalable Computing: Practice and Experience 6 (2005) 9-18 Desmedt Y. and Jajodia S.. “Redistributing secret shares to new access structures and its applications”. Technical Report ISSE TR-97-01, George Mason University, Fairfax, VA, July 1997.
Part III Bioinformatics on the Grid
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
155
Demonstration of In Silico Docking at a Large Scale on Grid Infrastructure Nicolas JACQa, b, Jean SALZEMANNa, Yannick LEGREa, Matthieu REICHSTADTa, Florence JACQa, Marc ZIMMERMANNc, Astrid MAAßd, Mahendrakar SRIDHARc, Kasam VINOD-KUSAMc, Horst SCHWICHTENBERGd, Martin HOFMANNc, Vincent BRETONa a
Laboratoire de Physique Corpusculaire, Université Blaise Pascal/IN2P3-CNRS UMR 6533, France b Communication & Systèmes, CS-SI, France c Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, GERMANY d Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Simulation Engineering, GERMANY
Abstract. WISDOM stands for World-wide In Silico Docking On Malaria. First step toward enabling the in silico drug discovery pipeline on a grid infrastructure, this CPU consuming application generating large data flows was deployed successfully on EGEE, the largest grid infrastructure in the world, during the summer 2005. 46 million docking scores were computed in 6 weeks. The proposed demonstration presents the submission of in silico docking jobs at a large scale on the grid. The demonstration will use the new middleware stack gLite developed within the EGEE project. Keywords: Large scale deployment; grid infrastructure; in silico docking; drug discovery; malaria
Malaria is a dreadful disease affecting 300 million people and killing 1.5 million people every year [1]. Drug resistance has emerged for all classes of antimalarials except artemisinins. This example illustrates the real need for new drugs against neglected diseases. The World Health Organization declared its will to help the development of new drugs against neglected diseases starting from hits proposed by the fundamental researchers. Advance in combinatorial chemistry has paved the way for synthesizing large numbers of diverse chemical compounds. Thus there are millions of chemical compounds available in the laboratories and also in 2D, 3D electronic databases, but it is nearly impossible and very expensive to screen such a high number of compounds in the experimental laboratories by high throughput screening (HTS). Besides the high costs, the hit rate in HTS is quite low, about 10 to 100 per 100,000 compounds when screened on targets such as enzymes [2].
156
N. Jacq et al. / Demonstration of In Silico Docking at a Large Scale on Grid Infrastructure
An alternative is high throughput virtual screening by molecular docking, a technique which can screen millions of compounds rapidly, reliably and cost effectively. Screening millions of chemical compounds in silico is a complex process. Screening each compound, depending on structural complexity, can take from a few minutes to hours on a standard PC, which means screening all compounds in a single database can take years. Computation time can be reduced very significantly with a large grid gathering thousands of computers [3,4]. WISDOM (World-wide In Silico Docking On Malaria) is an European initiative to enable the in silico drug discovery pipeline on a grid infrastructure. Initiated and implemented by Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) in Germany and the Corpuscular Physics Laboratory (CNRS/IN2P3) of Clermont-Ferrand in France, WISDOM has deployed a large scale docking experiment on the EGEE [5] infrastructure. Three goals motivated this first experiment. The biological goal was to propose new inhibitors for a family of proteins produced by Plasmodium falciparum. The biomedical informatics goal was the deployment of in silico virtual docking on a grid infrastructure. The grid goal is the deployment of a CPU consuming application generating large data flows to test the grid operation and services. Relevant information can be found on http://wisdom.eu-egee.fr and http://public.eu-egee.org/files/battles-malaria-grid-wisdom.pdf. The first large scale docking experiment ran on the EGEE grid production service from 11 July 2005 until 19 August 2005. It saw over 46 million docked ligands, the equivalent of 80 years on a single PC, in about 6 weeks. Usually in silico docking is carried out on classical computer clusters resulting in around 100,000 docked ligands. This type of scientific challenge would not be possible without the grid infrastructure - 1700 computers were simultaneously used in 15 countries around the world. WISDOM demonstrated how grid computing can help drug discovery research by speeding up the whole process and reduce the cost to develop new drugs to treat diseases such as malaria. The sheer amount of data generated indicates the potential benefits of grid computing for drug discovery and indeed, other life science applications. Commercial software with a server license was successfully deployed on more than 1000 machines in the same time. First docking results show that 10% of the compounds of the database studied may be hits. Top scoring compounds possess basic chemical groups like thiourea, guanidino, aminoacrolein core structure. Identified compounds are non peptidic and low molecular weight compounds. Future plans for the WISDOM initiative is first to process the hits again with molecular dynamics simulations in a grid environment. A second data challenge planned for the fall of 2006 is also under preparation to improve the quality of service and the quality of usage of the data challenge process on gLite. In silico docking at a large scale is a first step toward enabling the in silico drug discovery pipeline on a grid infrastructure against neglected diseases. Key issues in the pharmaceutical process include cost and time reduction in a drug discovery development, security and data protection, fault tolerant and robust services and
N. Jacq et al. / Demonstration of In Silico Docking at a Large Scale on Grid Infrastructure
157
infrastructure, and transparent and easy use of the interfaces. With the help of the grid, a such large scale in silico experimentation is possible. The aim of the demonstration is to show the submission of in silico docking jobs at a large scale on the grid using the environment Taverna [6]. The user will prepare the jobs with a protein target, a compounds database, a docking software and a set of parameters, submit them on the grid, check their status and receive their output. Demonstration conditions will be similar to a large scale submission of jobs to reduce the necessary time for the in silico docking process. The demonstration will use the new middleware stack gLite [7], Lightweight Middleware for Grid Computing, developed within the EGEE project.
References [1] [2] [3] [4]
[5]
[6]
[7]
J. Weisner, R. Ortmann, H. Jomaa, M. Schlitzer, Angew. New Antimalarial drugs, Chem. Int. 42 (2003) 5274-529. R.W. Spencer, Highthroughput virtual screening of historic collections on the file size, biological targets, and file diversity, Biotechnol. Bioeng 61 (1998) 61-67. A. Chien, I. Foster, D.Goddette, Grid technologies empowering drug discovery, Drug Discovery Today, 7 Suppl 20 (2002) 176-180. R. Buyya, K. Branson, J. Giddy, D. Abramson, The Virtual Laboratory. A Toolset to Enable Distributed Molecular Modeling for Drug Design on the WorldWide Grid, Concurrency Computat.: Pract. Exper. 15 (2003) 1–25. F. Gagliardi, B. Jones, F. Grey, M.E. Bégin, M. Heikkurinen, Building an infrastructure for scientific Grid computing: status and goals of the EGEE project, Philosophical Transactions: Mathematical, Physical and Engineering Sciences, 363 (2005) 1729-1742 and EGEE Homepage: http://public.eu-egee.org T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, P. Li, Taverna: A tool for the composition and enactment of bioinformatics workflows Bioinformatics Journal 20(17) pp 3045-3054, 2004, doi:10.1093/bioinformatics/bth361. gLite Homepage: http://glite.web.cern.ch/glite/
158
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
A Gridified Protein Structure Prediction System “Rokky-G” and its Implementation Issues Shingo MASUDA a, Minoru IKEBE a, Kazutoshi FUJIKAWA a, Hideki SUNAHARA a a Graduate school of Information Science, Nara Institute of Science and Technology
Abstract. In recent years, simulation using computer systems has been of increasing importance in the life sciences. We have developed a system called “Rokky-G” that facilitates a protein structure prediction strategy called “Rokky” on Grid systems. Rokky-G provides the framework of protein structure prediction on the Grid. In this paper we discuss the architecture of Rokky-G and implementation issues identified in order to obtain highly reliable results. Keywords. Grid Computing, Protein structure prediction, Trial-and-error, Job Priority, Resource allocation
1. Introduction Protein structure prediction using large computer systems has been of increasing importance in the life sciences in recent years. Protein structure prediction is intended to determine the three-dimensional structure of a protein (known as the “tertiary structure”) from a given amino acid sequence (known as the “primary structure”). The predicted structure is very useful for other life science and medical fields because a protein’s function is closely related to its three-dimensional structure. However, there is no unique prediction algorithm which can find the correct structure for every protein, although several algorithms have been developed. Developing an effective prediction method is therefore a matter of high priority for researchers in the life sciences and medical fields. Takada et al. have been developing an effective protein prediction system called ̌Rokky̍ [4]. Rokky is a web-based fully-automated server that can predict structure from a given amino acid sequence. Although Rokky can provide good prediction results, there are two issues which may be addressed to improve the results. One is that human intervention may be incorporated into the prediction procedure, because it can help with prediction failures. The other is that the current Rokky system only runs on a single computer cluster system. Naturally, we expect that Rokky would provide better prediction results if it could benefit from a Grid environment. In order to accommodate the these issues, we have been developing ̌Rokky-G̍ which is a Grid-oriented protein structure prediction system. With Rokky-G, all simulation programs for structure prediction can be handled as jobs in Grid environments [5]. In developing Rokky-G, we have three design objectives: one is to minimize modification of the original Rokky system. The second is to enable the
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
159
straightforward addition of other programs to Rokky-G, and the third is to improve reliability of results. In this paper, we discuss the implementation methodology of Rokky-G intended to satisfy the above design goals. The rest of the paper is organized as follows. Section 2 describes the original Rokky system. Section 3 describes the architecture of Rokky-G. Section 4 reveals and evaluates a prototype implementation of Rokky-G, and finally Section 5 describes related work and concludes the paper.
2. Rokky and its protein structure prediction algorithms. Rokky is a web-based fully-automated server that can predict protein structure given an amino acid sequence. Rokky’s performance was benchmarked in a recent world-wide blind test at the CASP6 [7], revealing that Rokky is the second best prediction server among over 50 servers for targets without a template protein. A distinctive feature of Rokky is that the system integrates the standard bioinformatics tools and servers with a fragment assembly simulator called “SimFold” [6, 7]. Rokky consists of three simulation programs and several scripts. There is a master script which invokes other scripts and simulation programs. The three simulation programs are PSI-BLAST, 3D-Jury and SimFold, which are for targets of CM, FR, and NF respectively. Rokky performs the following steps automatically. First, Rokky executes PSI-BLAST [1] and produces some results. Next, Rokky invokes 3D-Jury [2]. When results are returned from 3D-Jury, Rokky computes the results of “Comparative Modeling” and decides whether or not SimFold should be run. If there are unsolved portions in the target sequence, Rokky executes SimFold. Finally, Rokky combines the output of CM, FR and NF in order to generate the final results. In Rokky, a user can improve the reliability of results using trial-and-error. SimFold adopts a “Monte Carlo/Simulated Annealing” method. In the Monte Carlo method, multiple simulations with different parameters can be executed independently and simultaneously. To get reliable results within a limited time, a user can check the intermediate results during simulation execution and change the parameters in Rokky dynamically.
3. Architecture of Rokky-G 3.1. Design goal In developing Rokky-G, we have the following three design objectives: • • •
Minimize modification of the original Rokky programs Enable the straightforward addition of other programs Enable the user to improve the reliability of results
The first objective is intended to reduce the cost of gridifying the original programs. There are several simulation programs and scripts in Rokky. It is desirable that the original simulation programs and scripts can simply be treated as jobs or tasks in Grid environments.
160
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
No one complete algorithm has been established for protein structure prediction. It is therefore sometimes desirable to use other simulation programs to improve the reliability of results. In fact, Rokky previously used “FUGUE” [3] as an FR program. As described in the previous section, executing simulation programs in a trial-and-error manner can often help to improve of the reliability of results. Trial-and-error execution of simulation programs should therefore be made straightforward. 3.2. Architecture In Rokky, several simulation programs and scripts are invoked by the main script, and some parameters and settings are embedded in these scripts. In Grid environments, the computational resources available are not determined in advance. However, simulation programs and scripts must be installed on the computational nodes in Grid environment in advance and this makes changing the parameters embedded in scripts problematic. Furthermore, Rokky’s main script depends on the configuration of other programs and computational resources. Considering this problem, we gridify Rokky as follows. First, we develop two middlewares. One is a workflow management system called “Workman”. Workman can replace Rokky’s main script. Workman provides the functionality to automate the execution of several programs sequentially. The other middleware is a job management system called “Jobman”. Jobman enables dynamic computational resource allocation. It also enables effective job execution and enables more reliable results to be obtained in a given time. Generally, gridified programs are reconstructed as services working in a coordinated manner with each other. A service consists of several programs (including some middleware) and computer resources (such as cluster computers). It provides abstract methods that may be called by users or other services. Workman and Jobman are services. All executions are performed by invoking services using abstract methods. Program control is achieved through the invocation of services, as shown in figure 1, and a workflow consists of a number of services and their relationships. Requests to service
Abstract methods
Service
Run
Program1
Program2
Computer Resources Figure 1: A service
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
161
When we gridify Rokky, we construct services for each algorithm and include several scripts and programs because it is rare that scientists wish to execute only a part of an algorithm. We construct a Rokky-Main service with functionality similar to that Rokky’s main script. However, the Rokky-Main service doesn’t depend on the configuration of the executing environment or other services. While in the original Rokky system, computer resource allocation, location of input/output files and the name of the program to execute are determined statically before execution of a complete prediction sequence, the Rokky-Main script can control these parameters even if the prediction sequence is currently under execution. The functions of the Rokky-Main service are limited to the following: • •
Deciding whether or not it is necessary to execute SimFold Combining the output of CM, FR and NF to generate final results The following figure shows the architecture of Rokky-G.
User Interface (Workman, Web) Status
Job information
Control, Workflow
Workman Execution Server Status Request
Service Registry Jobman
Run Response
Computing Service Rokky-G Main
Run
Computing Service (BLAST)
Computing Service (3D-Jury)
Computing Service (Simfold)
Computing Service (Simfold)
Figure 2: The architecture of Rokky-G
In using Workman the user can easily define the order of program execution and the relationships among programs. This definition is called the workflow. Workman has the following two components: 1. Workflow design tool 2. Workflow execution server The User Interface in figure 2 incorporates the workflow design tool. All jobs are invoked by Jobman and the workflow execution server uses Jobman to execute programs. 3.3. The service for parallel jobs In this section, we discuss the service for parallel jobs. In Rokky, SimFold uses a type of Monte Carlo/Simulated Annealing that requires many parallel jobs. The Monte
162
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
Carlo method often requires a many execution patterns with different initial values, often generated randomly. Each execution can usually be an independent job, and these jobs can be performed on multiple computers concurrently. When SimFold is executed there are a large number of jobs of this kind. The execution of SimFold is however problematic, because the number of jobs necessary to obtain reliable results is indeterminate. There are often differences in the number of jobs required according to the domain, and it is necessary to assign more computational resources to executions which have a large number of jobs being run. The best granularity of parallel job services is expressed below. The requirements are as follows: • • •
Utilize all computer resources Enable dynamic computational resource allocation Facilitate straightforward service implementation
We can consider services at three levels of granularity. The list below shows what would be performed for a single invocation at each service level. <1>. Perform executions with all sets of initial values. <2>. Perform executions with some sets of initial values (which may be specified upon invocation). <3>. Perform execution with only one set of initial values. Choice <3> is most appropriate for our requirements.
SimFold Service
<1>
Master service SimFold Service
One Invocation
<2>, <3>
Job request of domain 1 Job request of domain 2
Figure 3: Service of Parallel jobs
Suppose we consider SimFold as an example. The type of service with the coarsest granularity is <1>. When we construct SimFold services as type <1>, all SimFold jobs are executed for each invocation and all processes are usually executed on one cluster system. Frequently however, computational resources are spread over many bases and we cannot make effective use of them. If SimFold services are constructed as type <2> or <3>, a single invocation of a service doesn’t perform all the jobs corresponding to the initial values. Therefore, we can distribute jobs over multiple services. Thus, we
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
163
can utilize multiple computer systems spread over many bases. There are usually cases when there are several domains which must be executed concurrently. However, a type <1> service cannot perform jobs in multiple domains concurrently because execution using this type in only one domain fills the computational resources of one service with jobs. This situation is shown in figure 3. When we construct services of type <2> or <3>, we must construct a “Master” service. The Master service is required to make lists of initial values and invoke the services of SimFold multiple times automatically. Additionally, the Master service collects all output data and compiles the ultimate result. We now explain which of these two types of SimFold service is better. Type <3> is better because we can construct a service straightforwardly. In fact, type <2> services are a super-set of type <3> services. The functionality to make multiple executions is included in type <2> services by invoking type <3> services once. The Master service must have a function to invoke the SimFold services many times in order to utilize services on multiple bases whether using type <2> or <3>. If we select type <2>, we must implement similar functions in the SimFold services and in the Master service. In protein structure prediction, it is necessary to perform executions in multiple domains simultaneously. For this reason, it is necessary that the Master service has the functionality to decide where to assign service executions in each domain. Consequently, we implement this function in Jobman. The Master service uses Jobman to invoke SimFold services and thus implementing the Master service is straightforward. As mentioned above, in constructing a service for “Parallel program execution” we should make two paired services. One of the two services is a calculation service which executes one process with one invocation, and the other is a Master service which coordinates the calculation service.
4. A Prototype implementation In this section, we explain a prototype implementation of the system, and the operations of Jobman and Workman. 4.1. Jobman: The Job management system We developed a prototype version of Jobman with the following functions: Deciding computer resource allocation for a job group according to its priority Changing computational resource allocation dynamically according to user decisions and/or process reports Both and regulate the assignment of jobs to services. Jobman knows the number of jobs which can be executed simultaneously on each service. The interactions between Jobman and the Master service, and the calculation services for Parallel execution are described below.
164
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
In order to handle computer resource allocation, Jobman makes groups consisting of multiple jobs. Priorities are assigned to a group and all the jobs participating in the group. All services are invoked by Jobman. We now explain the operation steps of Jobman using SimFold as an example. As described in the previous section, the Master service invokes SimFold services through Jobman. First, the Master service makes a request for the construction of a group to Jobman. When the Master service sends the request for a job to Jobman, Jobman doesn’t directly invoke SimFold services. The Master service must send a “Run group” request to start the job. The Master service sends job requests from all domains and information about default priorities before sending “Run group” requests. Jobman decides the number of jobs which execute simultaneously in each group using priority information. For example, suppose there are two SimFold services, named “Sa” and “Sb”. Each service can execute 30 jobs at the same time. Jobman receives requests for 2 groups, named (Ga) and (Gb), each with 80 jobs. When the priority of each group is the same, Jobman decides the number of concurrent jobs is 30 for each group. In many cases, SimFold services perform differently. For this reason, Jobman assigns 15 jobs of (Ga) and (Gb) to each service to prevent a large inequality occurring. If the priority of (Ga) is twice as high as the priority of (Gb), then Jobman decides the number of concurrent jobs in (Ga) to be 40, and 20 jobs are assigned to each service. The number of concurrent jobs in (Gb) is decided to be 20, and 10 jobs are assigned to each service. In this example, if both Sa and Sb have the same computational performance and each job has the same execution time-frame, 40 jobs in (Gb) will not have been executed at the point when all jobs of (Ga) finish executing. Jobman sends the decided number of “Run Job” requests to the calculation service. When Jobman observes that a job has completed, Jobman sends a request for the next job. Since job requests are sent step by step, the SimFold services have no regard for job priorities and only have to execute the jobs sent. 4.2. Workman: The workflow management system In this section we describe the workflow management system, Workman. We developed a Workman prototype with has following features: Straightforward performance of simulations in trial-and-error manner by the user Reduced total execution time according to the trial-and-error method Workman organizes cooperation among multiple services, and executes services in a user defined sequence. The function of Workman can be explained with a simple example. In the workflow system, we call a service that has been invoked “work”, and the whole process consisting of several pieces of “work” is called the “workflow”. Making the workflow is equivalent to defining and ordering the services to be executed. We now consider a case when we want to execute 2 simulation services. First, we define which services to execute and the execution order. We further define the relationships among input/output data. In this example, the two services have the following definitions, and we input the data shown in table 1 into the workflow system using a dedicated user interface.
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
Service A
Service B
Name: Protein-CM-sample Input data: Filename Name: Protein-NF-sample Input data: Filename
Output data1: Filename
165
Output data 2: Filename
Output data: Filename
Table 1 : Example Service.
Picture 1: Workflow user interface
First, we define the work by feeding in the names of each service, and then define the execution order. In the Workman user interface, a single box indicates work and an arrow shows the order of execution. For example, “A->B” means “execute work B after work A”. We have to define the input data of “Service A”, but we don’t have to define the output data, because Workman can create the file name automatically. Next, we define the data relationships. If we wish to use Output data 2 from Service A as input data, we define these data with a variable of the same name. At this point Workman may begin the workflow. After Workman receives a “Start” message Workman invokes the “protein-CMsample” service through Jobman. Jobman sends a “Run” message along with the definition of the input data to Service A. Service A will receive a file defining its input data and start calculating. When Service A finishes calculating, it sends a “Finished” message and the name of its output data to Jobman. Jobman sends these messages to Workman. Workman copies the output data to user storage. Next, Workman invokes “Service B” through Jobman. In this case, “Output data 2” will be used as the input data of Service B and this file name is generated automatically. Workman thus executes many services automatically. We now explain how Workman supports the changing of parameters. When a user wants to change parameters while a workflow is running, they must take the following steps: 1. 2. 3.
Select the work whose parameters are to be changed from the running workflow using Workman’s user interface Set the new parameters Instruct Workman to “Restart workflow”
166
S. Masuda et al. / A Gridified Protein Structure Prediction System “Rokky-G”
When a user has completed these actions, the workflow will be restarted. The advantage of using Workman is that we do not have to consider which services we should restart and which data we should reuse. Workman divides the running pieces of work into work which should be stopped and work which should be continued. Workman then copies data which are reusable and automatically sets the file name of the input data for the first piece of work restarted.
5. Conclusion There are several projects and researches focused on the use of the Grid System for bioinformatics. In the myGrid[9, 10] project, which is a UK e-Science project funded by the EPSRC involving five UK universities, some middleware and tools have been developed and these are used in the biosciences. Folding@home[11] is a distributed computing project making use of a large number of idle computers around the world. Anyone can easily join this project by downloading a single program and installing it. Such a person can then help research into understanding protein folding, mis-folding, and related diseases. These projects facilitate the study of proteins according to their unique technology. In our project, we support a trial-and-error method using an original workflow system. Furthermore, by using our job management system scientists can utilize multiple computational resources spread over many bases. In this paper, we have shown a methodology for gridifying a system for protein structure prediction. We believe that this methodology is applicable to other scientific programs. In addition, we have revealed important factors that improve the reliability of results. Our system works effectively and it has been shown that we can construct a system supporting the straightforward acquisition of reliable results.
References [1] BLAST http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html [2] Ginalski K, Elofsson A, Fischer D, Rychlewski L. : 3D-Jury: a simple approach to improve protein structure predictions. : Bioinformatics. 2003 May 22;19(8):1015-8. [3] FUGUE http://www-cryst.bioc.cam.ac.uk/fugue/ [4] Rokky http://www.proteinsilico.org/rokky/ [5] Kazutoshi FUJIKAWA, Wenzhen JIN, Sung-Joon PARK, Tadaomi FURUTA, Shoji TAKADA, Hiroshi ARIKAWA, Susumu Date and Shinji SHIMOJO : Applying a Grid Technology to Protein Structure Predictor “ROKKY” : STUDIES IN HEALTH TECHNOLOGY AND INFOMATICS 112 : pp27-36 [6] Yoshimi Fujitsuka, Shoji Takada, Zaida A. Luthey-Schulten, and Peter G. Wolynes : Optimizing Physical Energy Functions for Protein Folding : Proteins:Structure, Function, and Bioinformatics, 54:88-103, 2004 [7] Shoji Takada, Protein Folding Simulation With Solvent-Induced Force Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins, Proteins: Structure, Function, and Genetics, 42:85-98, 2001 [8] 6th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction http://predictioncenter.org/casp6/Casp6.html [9] R. Stevens, A. Robinson, and C.A. Goble myGrid: Personalised Bioinformatics on the Information Grid in proceedings of 11th International Conference on Intelligent Systems in Molecular Biology, 29th June–3rd July 2003, Brisbane, Australia, published Bioinformatics Vol. 19 Suppl. 1 2003, i302-i304 [10] myGrid http://www.mygrid.org.uk/ [11] Folding@home http://folding.stanford.edu/
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
167
Sealife: A Semantic Grid Browser for the Life Sciences Applied to the Study of Infectious Diseases Michael Schroeder a,1 , Albert Burger b , Patty Kostkova c , Robert Stevens d , Bianca Habermann e , and Rose Dieng-Kuntz f a TU Dresden, Germany b Hariot-Watt University, Edinburgh, UK c City University, London, UK d University of Manchester, UK e Scionics, Dresden, Germany f INRIA, Sophia-Antipolis, France Abstract. The objective of Sealife is the conception and realisation of a semanticgGrid browser for the life sciences, which will link the existing Web to the currently emerging eScience infrastructure. The SeaLife Browser will allow users to automatically link a host of Web servers and Web/Grid services to the Web content he/she is visiting. This will be accomplished using eScience’s growing number of Web/Grid Services and its XML-based standards and ontologies. The browser will identify terms in the pages being browsed through the background knowledge held in ontologies. Through the use of Semantic Hyperlinks, which link identified ontology terms to servers and services, the SeaLife Browser will offer a new dimension of context-based information integration. In this paper, we give an overview over the different components of the browser and their interplay. This SeaLife Browser will be demonstrated within three application scenarios in evidence-based medicine, literature & patent mining, and molecular biology, all relating to the study of infectious diseases. The three applications vertically integrate the molecule/cell, the tissue/organ and the patient/population level by covering the analysis of high-throughput screening data for endocytosis (the molecular entry pathway into the cell), the expression of proteins in the spatial context of tissue and organs, and a high-level library on infectious diseases designed for clinicians and their patients. For more information see http://www.biote.ctu-dresden.de/sealife Keywords. Grid computing, bioinformatics, ehealth, semantic web, textmining, ontologies
1. Introduction Currently, much effort is being spent on creating a new computational and data infrastructure to facilitate eScience, the cooperation of geographically distributed organisa1 Correspondance to: Michael Schroeder, Biotec, TU Dresden, ms@biotec.tu-dresden.de, +49 351 46340060
168
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
tions, which transparently integrate their computational and data resources at a structural and semantic level. Progress has been made with standards for grid computing and semantic representations for life science data with many projects creating a host of gridenabled services for the life sciences. How can the researcher in the lab benefit from this new infra-structure to science? A technology is needed to transparently bring such services to the desks of the scientists. The Web started with a browser and a handful of Web pages. The vision of eScience with an underlying Grid and Semantic Web will only take off with the development of a Semantic Grid browser that gives a user easy access to Grid and e-Science resources. The Sealife project is filling this gap by developing such a semantic grid browser. These browsers will operate on top of the existing Web, but they introduce an additional semantic level, thus implementing a Semantic Web. Using ontologies as background knowledge the browsers can automatically identify entities such as protein and gene names, molecular processes, diseases, types of tissue, etc. and the relationships between them, in any Web document, they collect these entities and then apply further analyses to them using applicable Web and Grid services. If the user points the mouse at a Semantic Hyperlink the SeaLife Browser offers a definition of the encountered term, the application of services relevant to the term, and to add the term to a shopping cart. After browsing through various pages and adding various terms to the shopping cart, the user decides to check out. The SeaLife Browser presents the contents of the shopping cart including the list of items collected, the type of the identified terms, and the sources where they were collected by the user. The SeaLife Browser offers to apply additional services considering combinations of terms. For example, if the user collected a set of proteins, then the browser will offer to apply a tool to compare the proteins’ sequences against each other, to create a multiple sequence alignment, or to query the literature for co-occurances of the two proteins. The user can save the current state of the shopping cart and return at a later stage to continue the semantic exploration. To summarise, the SeaLife Browser links the existing Web to the new eScience grid infrastructure paving the way for a future generation Web for the life sciences.
2. Case studies To illustrate the power of this vision consider the following applications in the context of infectious diseases. The applications vertically integrate the molecular/cell, tissue/organ, and patient/population layers by covering high-level information stemming from the national library of infectious diseases to detailed studies of high-throughput screening data for endocytosis, the entry pathway into the cell. • Evidence-based medicine: Consider a clinician, who consults the national electronic library of infections to get curated and trusted information on infections. The user visits the site and finds an interesting page on hipatitis and its treatment: “Ribavirin with or without alpha interferon for chronic hepatitis C”. Using its background knowledge, the SeaLife Browser identifies hipatitis as a disease and interferon as a cytokine and immunologic factor. With this knowledge the browser automatically offers the user the ability to query Ensmbl in order to learn more about the genetics related to hipatitis and the Protein Databank to look at struc-
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
169
tures of cytokines. The browser also offers the opportunity to explore the literature further. Via the ontology the browser can either refine searches and look for interferon type I e..g., or generalise and search for liver diseases etc. • Literature and Patent Mining: Getting a quick overview of a field is vital for companies to stay ahead of their competitors. In browsing a patent database, a researcher comes across the patent entitled “An improved infant formula is described which includes a phospholipid supplement in order to more closely resemble the composition of human milk.”. The SeaLife Browser identifies the term “phospholipid metabolism” and offers the following definition to the user: "The chemical reactions and physical changes involving phospholipids, any lipid containing phosphoric acid as a mono- or diester.". It also identifies human in its taxonomy. The user decides that this is relevant and wishes to learn more about phospholipids. The SeaLife Browser automatically offers the service of showing all human proteins in the UniProt database, which are involved in phospholipid metabolism. As another example, consider a biologist who wishes to know which enzymes are inhibited by levamisole. The researcher visits a traditional literature search engine such as PubMed to find relevant literature. PubMed returns over 100 articles for the query levamisole inhibitor. While the first articles already mention the enzyme alkaline phosphatase, there is only one article, which is ranked very low and which mentions phosphfructokinase. It is unlikely that the user would find this article. With the SeaLife Browser, the situation is different. With the background knowledge the browser identifies terms such as phosphofructokinase or alkaline phosphatase in the PubMed result page. With the ontology the browser can infer that both terms are enzymes. It can now offer to the user literature services which categorises the abstracts by enzyme activity, thus giving a direct overview of all the results and more directly answering the researcher’s question. • Molecular Biology: Consider a biologist, who encountered the statement “Rabaptin 5 interacts with the small GTPase Rab5 and is an essential component of the fusion machinery for targeting endocytic vesicles to early endosomes”. The SeaLife Browser identifies “Rabaptin-5” and “Rab5” as protein names, “endocytosis” as biological process, and “early endosome” as cellular component. When the user moves the mouse over "Rab5", the SeaLife Browser offers to search sequence databases for Rab5 proteins. At the same time, it offers to move the protein sequence of Rab5 to a shopping cart. After browsing for some time, the user decides to visit his/her shopping cart and takes a look at the proteins he/she has collected in the web-session. The SeaLife Browser now offers to perform a series of services on the protein sequences in the cart. One simple analysis includes a domain search of the collected sequences. For Rab5, this results in the identification of a GTPase domain. A multiple sequence alignment is displayed from the domain database, which gives the user an idea about conserved residues of the respective domain. In cases where a known threedimensional structure exists for the domain, the user invokes a molecular display tool to visualize the possible fold of his/her protein. Querying several databases for related information about his/her protein, like the online mendelian inheritance in man (OMIM), expression databases like SAGE, or the protein structure database (PDB) give the user information about links to
170
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
diseases, expression levels of his/her protein of interest in several tissues, as well as available structural data at one mouse-click. More sophisticated analysis tools allow the user to perform a sequence database search with his/her protein sequence in order to retrieve possible related sequence to his/her protein of interest or perform fold recognition with his/her collected items in the shopping cart, with which he/she could get an idea about the function of the proteins.
3. Aims and Objectives In these scenarios, there is an obvious reliance on the computer having some notion of domain semantics–what is the relationship between symbols in the language of biomedicine? To achieve the above vision, the following semantic problems need to be solved: • Ontologies: Design and integration of ontologies and associated infrastructure, which can serve as background knowledge for a Semantic Grid browser geared towards life science applications ranging from the molecular level to the person level. • Concept Mapping: Bridging the gap between the free text on the current Web and the ontology-based mark-up for the Semantic Web and Grid by developing an automated mark-up modules for free text, which are based on text-mining and natural language processing technologies. • Service Composition: Bridging the gap between the ontologies of the Semantic Web and the services of the Grid by linking suitable ontology mark-up to applicable services and by supporting the interactive creation of such mappings for complex services. 3.1. State-of-the-art Current work in the Semantic Web has concentrated upon the development of languages and infra-structure. Few real Semantic Web applications have been made to date. Biology is, however, already well placed to create a Semantic Web for Life Sciences with its large Web presence and growing use of ontologies. There is as yet, no transparent, user facing, easy browser for a Semantic Web or Grid. Stein’s vision of a bioinformatics Nation1 to bring together the distributed and heterogeneous resources of the bioinformatics community, will rely on such infra-structures as suggested by the semantic Web and Grid. In order to deliver a SeaLife Browser for biologists, target data and services are needed and bioinformatics is already well placed to do this. Many Web and Grid services are now available, with some delivering data formatted according to XML schema descriptions. Such efforts can be seen in my Grid 2 , HeatlhGrid 3 and the Biomedical Informatics Research Network (BIRN)4 . These projects, and others, bring together virtual organisations of computers, data, pro1 Lincoln Stein. Creating a bioinformatics nation: A web-services model will allow biological data to be fully exploited. Nature 417(119); 9 May 2002 2 http://www.mygrid.org.uk 3 http://www.healthgrid.org 4 http://www.nbirn.net
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
171
grammes, instruments and users to collaborate to perform in silico analyses and health care–that is, they form Grids in the bio-health domain. The bioinformatics domain, in particular, is already deploying much of the necessary infra-structure for these projects. Ontologies are already widely used in describing and analysing biological data. Foremost in these is the Gene Ontology5 and others in the Open Bio-Ontologies consortium. These provide a common language for describing, amongst other features, molecular function, biological processes and location of gene products; sequence features; description of microarray and proteomic experiments. This means large bodies of semantically marked up data already exist that could be explored by a SeaLife Browser. A growing number of these bio-ontologies are in the OWL format and the OBO formats can be represented in OWL, suggesting that this markup is in a form accessible to the proposed SeaLife Browser. As well as data, many Web and Grid Services mean that there is now programmatic access to many bioinformatics tools. EuroGrid, for example, the Bio GRID part of EuroGrid 6 developed an access portal for biomolecular modeling resources. The Semantic Grid, like the Web itself, relies on standards such as HTML and HTTP for the original Web. The UniGrids project 7 developed standard access mechanism over both Glogus and Unicore that are compliant with the Open Grid Services Architecture (OGSA). Many of the services in the projects above use Unigrid output to support these computationally intensive services across a Grid of computational resources such as DEISA (Distributed European Infrastructure for Supercomputing Applications)8 . DEISA is a consortium of leading national supercomputing centres in Europe that intends to jointly build and operate a distributed terascale supercomputing facility. It is such networks of computational power that will support computationally intensive analyses over eScience and eHealth data. As well as Grid Services, projects such as my Grid, use Web Services, which are envisaged to come together with Grid Services to provide a unified access style. There are already well over 2 000 Web Services available in bioinformatics including sequence searches with BLAST (also available through EuroGrid), the major databases such as the Ensembl database for genetic and disease information, MSD for protein structures, PubMed for biomedical literature, Kegg for metabolic pathways, INTERPRO for sequence profiles, the Emboss suite and many others. Together with Grid Services, these offer programmatic access to bioinformatics tools never seen before. Once available, these 1 000’s of services need to be able to be discovered and deployed, by humans as well as computers. There is a large effort to develop frameworks for semantically describing services through ontologies. Foremost amongst these is the Web Services Modelling Ontology (WSMO) and its markup form WSML. This creates a template to describe the inputs, outputs, pre- and post-conditions and tasks supported by Web and Grid services. This kind of markup, extended to the biological domain, will enable a SeaLife Browser, and other applications, to discover tools appropriately from semantic markup in a page. Projects such as my Grid and bioMOBY 9 have already ex5 http://www.geneontology.org 6 http://www.eurogrid.org 7 http://www.unigrids.org 8 http://www.deisa.org 9 http://www.biomoby.org
172
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
plored the use of semantic markup to aid discovery and composition of Web Services in bioinformatics. Intimately linked into the role of ontologies in Semantic Grids is text-mining. Pages and services need to be marked up with semantic descriptions provided by the background knowledge provided by ontologies. In addition, the knowledge captured by ontologies need to be collected from data resources, as well as human experts. These are the twin roles of text-mining within Semantic Grids. Through techniques such as stemming and part of speech tagging, coupled with co-location, text-mining can deliver the terms, their synonyms and relationships that need to be used within ontologies. Once deployed, a SeaLife Browser needs to identify places within pages to be accessed for pages not yet marked up and this will be achieved with text-mining techniques. Finally, the semantic markup of pages will be driven by text-mining tools – too much data exist for human curation to be relied upon. text-mining is already a widely used technique within bioinformatics. Entity recognition, to create dictionaries of gene and gene product names is well explored. For example, the goal of the BioMinT10 project aims to develop a generic text-mining tool that (1) interprets diverse types of query, (2) retrieves relevant documents from the biological literature, (3) extracts the required information, and (4) outputs the result as a database slot filler or as a structured report. The E-bioSci platform11 offers access to retrieval of full texts and facts from the vast natural language knowledge base formed by the collected biomedical literature, databases from sequences to images. Such tools will be an invaluable component for generating both ontologies and markup using those ontologies. Such efforts are already underway in projects such as MMTx12 , that is mapping terms in documents to the UMLS metathesaurus. This provides, in essence, a nascent Semantic Web, but without the browsing tools to exploit the marked up documents. text-mining can also serve to annotate contents in formats such as the Resource Description Framework (RDF). Haystack13 is such an RDF browser, which allows a user to view RDF stores and to personalise data, by placing links where they feel the need and configuring the user interface, through links, buttons and actions, to do the job they wish in a particular context. It attempts to enable users to work with information, not applications. Instead of having barriers between, for instance, calendar, email, browser and word processor, metadata allows information to be used in any suitable context of work. These technologies come together in several bioinformatics projects. The my Grid project has developed a set of middleware components that support the e-Scientist in performing and managing in silico experiments in biology. Web and Grid Services provide access to distributed resources, while workflow techniques enable the orchestration of these resources to perform experiments. The my Grid middleware is a toolkit of core components for forming, executing, managing and sharing discovery experiments. The components are intended to be adopted in a “pick and mix” way by developers and tool builders to produce end userapplications. This state of the art reveals that biomedical informatics is already making great use of Semantic Grid and Web infra-structure and technology. Many have concentrated on 10 http://www.biomint.org 11 http://www.e-biosci.org 12 http://mmtx.nlm.nih.gov/docs.shtml 13 http://haystack.lcs.mit.edu/
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
173
data generation, data description and other aspects of the domain. A few have brought elements of this infra-structure together in applications. At no point, however, is there the equivalent of a browser that may be used to look at arbitrary resources and exploit the semantic content to perform eScience. Thus Sealife takes the next step towards implementing the vision of eScience for the life sciences.
4. The Sealife Components and their Interplay As mentioned in Section 3, to implement the vision of Sealife, three problems need to be solved: ontology design and evolution, concept mapping to link the web to the ontology terms, and service composition to apply relevant services. 4.1. Ontologies At heart, an ontology is a structured set of vocabulary terms and their definitions that captures a community’s understanding of its domain. the idea is to create a shared understanding of the symbols (terms) used to communicate in that domain. Thus, the Gene Ontology creates an agreed set of vocabulary terms for describing the major attributes of gene products. However, it is not only a facilitator for human communication. By capturing this knowledge in a knowledge representation language with strict semantics, it is possible to enable machines to manipulate these symbols through the semantics of the language. The Web Ontology Language (OWL) is the WorldWideWebConsortium’s recommendation for representing ontologies for the Semantic Web. OWL has a strict semantics and its description logic version (OWL-DL) can be used for reasoning over the ontology and its instances. Many bio-ontologies, however, are represented in a more simple language that describes a directed acyclic graph (DAG). This allows only minimal machine usage, but it is directly transformable to OWL. The large number of ontologies in this form (all those in the Open Biomedical Ontologies collelction ) offer a potentially vast background knowledge for the SeaLife Browser. Medical ontologies are available in a variety of representations. Some are open and some of these can be mapped into OWL with ease. Others, such as the Medical Subject Headings (MeSH) are a simple thesaurus design for informaiton retrieval and are not really automatically transformaable to OWL. Nevertheless, there is a large amount of biomedical ontology already extant for SeaLife Browser. Protégé is the most widely used ontology development environment. Its OWL plugin offers a GUI style interface for building and using OWL ontologies. protege’s wide range of plugins make it a rich environment. SWOOP, however, offers a much lighter development environment, but has considerable debugging facilities. Outside the OWL world, DAGEdit and OBOEdit are the most widely used tools in bio-ontologies. the former produces the DAG format of the OBO collection. OBOedit, a later development than DAGEdit, offers a richer environment with more modelling constructs. Protégé, being a more robust and wide-ranging environment than the others, captures more of the principles for building ontologies. These can be split into two broad areas: First those that represent a software engineering approach and second, those that embody philosophical principles within ontology. The first are guidelines of requirements/scope;
174
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
knowledge elicitation; design, conceptualisation; encoding; testing/evaluation; publication. these phases map onto a typical software engineering process and many tools and Protégé plugins exist for these stages. Philosophical aspects of ontology building represent the debate on what an ontology can and should represent; styles of building; writing definitions; etc. One development principle not mentioned is that of re-using ontologies. As already mentioned, many ontologies exist in biomedicine. Once transformed to a common representation and thus a common language semantics, they must be either merged into one or mapped to one another. This is because ontologies can overlap etc. and these overlaps must be recognised and accomodated. A number of such integrations efforts exist within biomedical ontologies. One example is Xspan.14 This uses a cross-species ontology of anatomy from embryo stages to adult form. The terms from the various species have to be mapped and Xspan have developed the COBrA tool to facilitate this mapping. 4.2. Text-mining The concepts of the ontologies have to be linked to text in web pages. This task is far from trivial as the concepts will occur in wide variations. The following problems need to be addressed: • Information content of words: Consider the term alkaline phosphatase activity from the GeneOntology. A query on the literature database PubMed for alkaline phosphatase leads to more than two times more results than alkaline phosphatase activity and to more than ten times more results than "alkaline phosphatase activity". This is particularly striking as the word activity is not very informative, as nearly one third of GeneOntology terms end in activity. • Insertions and deletions of words: An ontology term may consist of several words, which are separated by inserted words in free text. For example, the text ...at a higher rate than freshly isolated monocytes upon activation... should match the GeneOntoogy concept monocyte activation and the text ...large family of transcription factors that bind to ... should match the term transcription factor binding. • Stemming: Words such as binding and binds have to reduced to the stem bind. • Sentence splitting: text-mining has to identify sentences as units. This is not trivial, as a dot separates two sentences, but it occurs also in abbreviations such as ca., etc., C. elegans. • Special characters: Often ontology terms contain special characters such as slashes, commas, brackets, dashes, etc., which have to be treated appropriatedly. For example, the slash in the term chromatin assembly/disassembly, the slash acts as delimiter between two tokens, while in Arp2/3 complex the slash is no delimiter. • Ambiguous concepts: Sometimes ontology concepts are not formulated unambiguously. for example, the term small-molecule carrier or transporter should have to match both small-molecule carrier and small-molecule transporter. Sealife’s text-mining module addresses these problems and thus maps concepts to text in the web pages. 14 http://www.xspan.org
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
175
4.3. Service Composition Once terms have been identified in the SeaLife Browser, they are linked to other resources. A user can, for instance, put a sequence into their Sealife cart. This could be submitted to a service or series of services to perform an analysis. In many cases, more than one service will be used. The following issues will have to be addressed: • Services will have to be discovered. Many thousands of services now exist. Currently, these are only described by their name and these are not necessarily informative. Efforts to semantically describe these services will reduce this barrier for both people and machines. What should be described? the following are some axes of description: ∗ ∗ ∗ ∗ ∗ ∗
Input Output; Task performed by the service; Service name; Algorithm used; etc.
• Once discovered, how are the services to be composed? Here the following issues are revealed: ∗ In many cases, bioinformatics services are implicitly typed. A service takes an input of string and gives an output of string. There is often much structure within one of these strings (for instance, a Uniprot record). Services are needed to locally impose some type on these strings in order to compose them. ∗ A minority of services have input and output in some structured XML document. Again, a variety of XML schema exist, so typing services are still needed. Nevertheless, the XML syntax of such input/output documents makes this process easier. ∗ A variety of typical type operations are needed in order to compose services: Access, coercion; etc. An open system such as my Grid brings more of these problems than a closed system. In a closed system, it is easier to impose a type system, but it does place a barrier to third party services joining the system. SeaLife Browser will of necessity be open, so poorly typed services will be endemic. Composition of services will be part of the SeaLife Browser solution.
5. Existing Prototypes To realise the Sealife browser, there are already the following key components. 5.1. Ontology editors, evolution and design In creating the ontology background knowledge for the SeaLife Browser, it will be necessary to transform ontologies into the OWL format. The Gene Ontology Next Generation
176
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
(GONG) project15 developed a methodology to migrate from the simple DAG used by GO to rich descriptions that are possible in OWL. After transforming the DAG to OWL, a simple mapping, the source ontology is already useable in the SeaLife Browser. it is possible, however, to make the source ontology even more useful by migrating it towards a property based description. Many of the OBO ontologies have much of their definition implicit in the class name or term. For instance, “glucose biosynthesis” is a chemical term followed by a process name. These can be made explicit in OWL and in GONG. The mapping is done using regular expressions to match term styles and then generate the implicit OWL definition. When combined with appropriate supporting ontologies, OWL and a reasoner are able to find many implicit subsumption relationships (on average, one in ten classses had a missing subsumption relationship). GONG is now available as a Protégé plugin and will form a component of the SeaLife Browser infra-structure. An example of the use of a richer GeneOntology resulting from GONG was to guide annotation. By inter-linking the three ontologies of function, process and location, it is possible for a machine to know what molecular functions, for instance, are involved in a particular biological process. by reasoning using the relationships between GeneOntology classes, it is possible to narrow the range of other ontological terms to those sensible to be used (for instance in annotation). It is clear that a similar process can be used in the SeaLife Browser. Rich interlinked ontologies, together with a reasoner can be used to guide a user through the web of science delivered by SeaLife Browser. 5.2. GoPubMed GoPubMed16 is an ontology-based literature search engine, which has indexed over 15 000 000 PubMed abstracts with GeneOntology terms. This system will be a key component underlying Sealife’s text-mining module. GoPubMed - as shown in Fig. 1 allows users to explore PubMed search results with the GeneOntology. GoPubMed submits a user’s query to PubMed, retrieves the recommended articles, extracts GeneOntology terms from these abstracts, and then displays the part of the GeneOntology covering the extracted GeneOntology terms. This induced ontology can then be used to display articles from the result set, which mention a specific GeneOntology term, including its synonyms or children. With this approach, GoPubMed goes beyond classical search and allows users to answer questions. Consider the following example: A researcher wants to know which enzymes are inhibited by levamisole. A keyword search for levamisole inhibitor produces well over 100 hits in PubMed. With GoPubMed these hits can be systematically be explored for enzyme activities. As shown in Fig. 1, the user can click on molecular function and then catalytic activity, which reveals that the result set contains cyclases, transferases, isomerases, hydrolases, lyases, small protein conjugating enzyme activity, and oxidoreductases. Following the most frequently mentioned enzymes, the user learns that many papers mention alkaline phosphatase. The user can also find less obvious facts, such as a single paper on phosphofructokinase activity listed among the transferases, which indeed confirms that levamisole inhibits tumor phosphofructokinase. 15 http://gong.man.ac.uk 16 http://www.gopubmed.org
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
177
Figure 1. User interface of GoPubMed. On the left, part of the GeneOntology relevant to the query is shown and on the right the abstracts for a selected GeneOntology term. Clicking on a term in the tree, the papers, which have been annotated with this term, are displayed.
5.3.
my
Grid
my
Grid offers a range of Grid enabled services in a pick and mix style. An applicaiton builder can take a variety of these services and use them within their application. The SeaLife Browser will be such an application and could use the following services: • A workflow enactor, Freefluo, can take a workflow described in the XSCFL language and run it against external, distributed services. • my Grid is capable of using any third party Web Service. A registry of services can be used like a library catalogue to find and then retrieve Web Services. • SOAPLab is an application for automatically wrapping command line applications as Web services. In a discipline such as bioinformatics, with many legacy programmes and use of command-line as a rapid development route, such an application is vital. • A notification service, with a subscription mechanism, can be used to notify users and applications in the change of status of services. • A provenance service records the Web of science generated from workflows. this is itself a Semantic Web that the user can browse via SeaLife Browser. These services and more can offer support for the activity envisaged within SeaLife Browser.
6. Conclusion The SeaLife Browser will make eScience’s web servers and services available to the bench scientists by using text-mining to identify ontology terms in free text and by linking the ontology terms to applicable services. The SeaLife Browser thus introduces
178
M. Schroeder et al. / Sealife: A Semantic Grid Browser for the Life Sciences
the novel concept of semantic hyperlinks, which are generated on the fly and use the browser’s background knowledge to dynamically link web pages to relevant services. The technical key challenges of the system are the design of ontologies, text-mining for concept mapping and service composition. For all three aspects, there are existing systems and results such as the ontology editor GONG, the ontology-based literature search engine GoPubMed, and the bioinformatics grid system my Grid. These will form the backdrop for the realisation of the SeaLife Browser, which will be applied to the study of infectious diseases ranging from the patient and clinician exemplified by the National electronic Library of Infectious diseases17 to molecular biologists studying endocytosis.
Acknowledgements Funding by the EU project FP6-2006-IST-027269 is kindly acknowledged.
17 http://www.neli.org.uk
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
179
Advancing of Russian ChemBioGrid by bringing Data Management tools into collaborative environment Alexey Zhuchkova, Nikolay Tverdokhlebovb,1, Alexander Kravchenkoc a Telecommunication Centre “UMOS”, Russia b Institute of Chemical Physics, Russia c Moscow State Technical University of Electronics and Mathematics, Russia
Abstract. Virtual organizations of researchers need effective tools to work collaboratively with huge sets of heterogeneous data distributed over HealthGrid. This paper describes a mechanism of supporting Digital Libraries in HighPerformance Computing environment based on Grid technology. The proposed approach provides abilities to assemble heterogeneous data from distributed sources into integrated virtual collections by using OGSA-DAI. The core of the conception is a Repository of Meta-Descriptions that are sets of metadata which define personal and collaborative virtual collections on base of virtualized information resources. The Repository is kept in a native XML-database Sedna and is maintained by Grid Data Services. Keywords. HealthGrid, OGSA-DAI, metadata, virtual collections, digital libraries
Introduction Total computerization has transformed data gathering into an industrial process in many sciences. Pharmacology, biotechnology and chemistry of natural polymers (ferments, proteins, etc.) supply terabytes of data. Scientific and clinical results populate corporative and public Web data-warehouses immediately after end of investigation or even flow directly from automatic devices which should be rather called data-fabric (mass-spectrometers, DNA sequencers, etc.). Grid technology provides already many successful solution to massive computational data processing. However, data search in cyber-space became a sort of trick. There is an urgent need of new generation of tools to operate with data and data warehouses automatically in considerable extent [1]. Another side of the problem is originated in a necessity of collaborative work on biomedical experimental data. The situation and social value of the problem do require a new generation of information technology to provide adequate abilities to handle huge data sets collaboratively for Virtual Organizations (VO) of experts. There are the following problems which should be solved and we suppose that they could be solved: x There is too much data which should be analyzed and processed. 1 Corresponding author: Nikolay Tverdokhlebov, Institute of Chemical Physics, Kosygina 4, Moscow, 119991, Russia; E-mail: rgrid@umos.ru.
180
A. Zhuchkov et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
Data are heterogeneous and distributed. Biomedical data are very sensitive to unauthorized access. Virtual organizations need to share an integrated information space. Successful research requires establishing sets of logical links among heterogeneous data. x Linked data should be organized in virtual collection while the latter should be assembled into digital libraries. Fortunately, there is already an appropriate information technology – the Grid and its data-oriented subfield OGSA-DAI [2]. This middleware supports the exposure of data resources, such as relational or XML databases, on to Grid. The OGSA-DAI software includes components for querying, transforming and delivering data in different ways while using Globus Grid middleware as a base. Hence, Globus and OGSA-DAI middleware provide basic functionality for collaborative work with distributed heterogeneous information resources – user authentication and authorization, data allocation in distributed warehouses and data delivery by queries. Previous research and software development provide Ontology Grid-services as effective tools to link distributed biomedical data [3]. On this base we can now build services to provide functionalities of a higher level, namely, to assemble heterogeneous distributed data into personal and collaborative Virtual Collections (VC) and then into Digital Libraries (DL) in the framework of the BiblioGrid project (www.rgrid.ru/bibliogrid). The BiblioGrid project is aimed to develop a set of methods and tools which provide VO of experts with opportunities to manipulate by Information Objects (IO) by means of a set of services. VC as well as DL provides opportunities to accumulate personal and corporative knowledge as a result of collaborative work in biomedical research. Thus, Grid and OGSA-DAI will provide an operational environment, while services of VC and DL will maintain ontologies constructing, metadata producing and intellectual processing of queries to distributed information resources. x x x x
1. Information Objects, Virtual Collections and Digital Libraries VOs are considered here as dynamic unions of users, resources and services. The VO clearly defines policies of safety, resource access and mutual obligations [4]. Any user must be a member of some VO in order to access Grid-resources. BiblioGrid follows this principal approach completely. Users should have got certificates that are issued by a certification center which in turn should be appropriate for resource centers. A user can belong to several different VOs and hence has got several different certificates. IO, VC and Repository of Meta-Descriptions (RMD) are the core concepts of BiblioGrid. An IO aggregates content data, its metadata and activities which provide access to the data resource. The content data of an IO should be allocated in an OGSA-DAI data resource (relational or XML database, Web-site, file, etc.) whilst the appropriate set of metadata in extended METS format [5] is being stored in the RMD as XML-file. The activities of the IO are also being stored in RMD. This means that the RMD is used to store an XML-descriptions of parameters of the activities whilst the activities itself are an intrinsic part of Grid Data Service (GDS) because they are indeed java-classes which implement interfaces of these activities. Hence, the mentioned XML-description of an IO activity could be considered as a sort of metadata. Main requirements to IO structure are that it should be flexible and expandable. There is a special sort of
A. Zhuchkovm et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
181
metadata – system ones whilst descriptive and structural metadata are considered as parts of content. Note that in the current implementation of BiblioGrid the IO is tied to the one real data resource only. RMD is an XML-database which stores meta-descriptions of IOs. An advantage of this solution is that it is possible to use the standard tools of OGSA-DAI only. Besides, it means that RMD can be distributed and any accessible resources of Grid-segment can be used to store a part of RMD. Additionally, this solution provides an ability to involve miscellaneous non-Grid information resources (e.g., databases outside the Grid-segment) into VCs and also simplifies interoperability between Grid-segments. VC is a set of IOs that are logically tied. To build up a personal VC the user construct a subject-oriented ontology in which any concept is an IO whilst the ontology links represent how the concepts are logically tied. Collaborative VC is that one in which some parts are created by several different users which have access rights to change the correspondent ontology. In fact, a VC can be considered as a sort of IO though the VC can aggregate IOs which based on different data resources. This is possible because the VC is virtual indeed so a user may have no information where real content data are allocated and which real services provide access to them. Besides, all information that constitutes a VC is being stored in RMD in the same way as an IO (i.e. in an XML-file in extended METS format). Thus, the VC becomes an IO as a brick to build up a VC of a higher level. DL in BiblioGrid is a suite of subject-oriented instrumental tools and organizational actions which supports collaborative usage of distributed heterogeneous cyber-resources by VOs of experts. Obviously, a DL like a traditional library aggregates not only tools to process information (storing, keeping, searching, accounting, etc.) but also includes information resources. However, there are two important differences. The DL keeps virtual objects, so there is usually no real warehouse that belongs to the DL. Nevertheless, DL tools provide accounting of usage of any resources, both information ones and computational ones. Another difference (and an advantage) is that in DL a unit (atom) of storage is an IO instead of a book in the whole as it take place in traditional libraries. This makes available detailed intellectual search throughout all accessible information resources any sort of. Grid provides also an additional ability to use in the search process appropriate computational resources available to VO members.
2. Biomedical implementation of BiblioGrid Digital Libraries The reason to implement DL based on OGSA-DAI and RMD was originated in the strong need of effective tools to work collaboratively with huge heterogeneous data sets within the framework of Corporative Network “New Generation of Vaccines and Medical Diagnostic Systems” (CN). There are several VOs dedicated to different field of the subject (e.g., VO “Syntetic vaccines” and VO “DNA-vaccines”), which work mainly separately from each other, use different methods in processing of information and build up their own subject-oriented data collections. However, they partially use common computational, communication and storage resources of CN, especially those of them which are parts of RGrid segment (www.rgrid.ru). Moreover, there are many information resources that are used by both VOs and even by a few other VOs of the mentioned CN. The last in order but the first in value is the fact that all projects which are being carried on by VOs of the CN are conceived to be logically tied to increase the
182
A. Zhuchkov et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
efficiency of the research. Some information resources are used jointly whilst they are administered by owners in federative manner. Taking into account the high sensitivity of biomedical information to an unauthorized access we can conclude that Grid technology and OGSA-DAI provide the best base today to support secure cooperative usage of both information and computational resources of the CN. The prototype of biomedical BiblioGrid is based on cyber-resources of the CN and on cyber-resources of Telecommunication Centre UMOS (TC) as well [6]. TC supplies for BiblioGrid not only telecommunication resources but also provides basic middleware (i.e., administers system Grid-services), Certification Center (CA) and supports LDAP-server for biomedical VOs of the CN “New Generation of Vaccines and Medical Diagnostic Systems”. Globus Toolkit 4.0.1 is used as Grid middleware whilst OGSA-DAI WSRF 1.0 provides services to access distributed information resources. Security system of BiblioGrid is based on Community Authorization Service (CAS). Biomedical information resources in BiblioGrid are represented presently by more than 10 personal and collaborative databases which have been accumulated earlier in biological and medical institutions (Institute of Immunology, Institute of Biochemistry, Institute of Virology, etc.). These non-relation databases were translated into XMLform in order to have the uniform structure of information resources. Besides, there were included also such information resources as the collection of Ph.D. theses in Russian State Library and personal bibliographic collections gathered in the course of research and development of new vaccines and diagnostic systems. Additionally, it would be pertinently to mention the relational database “Epitopes of Hepatitus C virus” which was also included in the set of resources because OGSA-DAI provides services to work with relational databases. The RMD has been deployed on base of open-source XML database Sedna [7]. It runs on both Windows and Linux platform and represents a native full-featured database management system. The Sedna XML-database provides: x support for all traditional DBMS features (such as update and query languages, query optimization, fine-grain concurrency control, various indexing techniques, recovery and security) ; x efficient support for unlimited volumes of document-centric and data-centric XML documents that may have a complex and irregular structure; x full support for the W3C XQuery language in such a way that the system can be efficiently used to solve problems from different domains such as XML data querying, XML data transforming and even business logic computing (in this case XQuery is regarded as a general-purpose functional programming language).
3. Information Objects and Virtual collection in Repository of Meta-Descriptions All interactions between users and BiblioGrid information resources are being executed via basic Grid-service that is invented to control data - Grid Data Service (GDS). The latter is indeed the access point for a user. GDS provides ability to operate with distributed heterogeneous virtual data sources. Heterogeneity means here that there are used different models of data storing (relational, XML, flat file, etc.) as well as different database engines (MySQL, PostgreSQL, Xindice, eXist, etc.). There were no driver in GDS to work with Sedna
A. Zhuchkovm et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
183
XML-databases so we have had to develop appropriate software in the course of the BiblioGrid implementation. Virtuality of data sources means that user do not know neither where the data source is located nor what access rights are used. All information such a sort of is set up by administrator at the moment of Data Service Resource deploying and adding to GDS. To operate with a data resource the user should know the set of available activities of the Data Service Resource (DSR) which describes this data resource. It is important to note here that DSR should use only standard OGSA-DAI activities. Otherwise the virtuality would be broken because in this case the user must know some metadescription of this data resource. While using OGSA-DAI to construct a VC we consider IO as a set of DSR activities which constitutes an interface to interact with the data source. In this case deployment of the IO using OGSA-DAI means to make up a directory with the name that is identical to the name of the IO (unique at present node) and to write down three XML-files which describe configurations of the activity. In order to add an IO into a VC, the name of the IO should be added into the list of IOs in the file that describes the VC (_ogsadai_VC-name.dsr.xml) and which is stored in the RMD. The process of constructing a VC provides virtuality of information sources to users because a user uses the only GDS as the access point and the user may not know apriory neither structure of the VC nor data source locations nor details about IOs. Moreover, the user often has no need in such information to operate with the VC. Basic information that is of need for the user to operate with a VC is contained in metadata of the VC. Besides, the process supports distributed data sources as well as their heterogeneity. The structure of VCs lets OGSA-DAI services to process concurrently any untied activities and sets of activities of the Perform Document. Among the most useful properties of the proposed technology are flexibility and scalability of IO and VC. An IO can be easily transformed into a new one to provide new properties by adding new activities into an existing IO. Adding and excluding an IO into/out VC is the very simple operation though it requires, strictly speaking, to change metadescription of the VC.
4. Processing of Meta-Descriptions One of the key ideas of the BiblioGrid is an idea to represent metadata of IOs and VCs as an IO which is stored in the RMD in the same way as “usual” IOs. In this case it turn out to be possible to operate with metadata by standard tools of GDS. Metadata in RMD are stored in METS format as sets of XML-files. XML-native database Sedna has been used to keep biomedical VCs and metadata. This choice has been caused by the following advantages of Sedna – the DBMS-engine is provided as open-source software and supports XQuery 1.0 and XUpdate to change data in XMLdocuments [8]. Besides, XML-databases are being supported by OGSA-DAI project as a data source. We have equipped basic delivery of GDS with the appropriate driver in order to provide GDS with the ability to interact with Sedna databases. Hence, it is possible to search and change metadata in Sedna-based RMD by means of standard GDS tools (basic activities). Metadata are being stored in RMD in METS format [5], which defines three sorts of metadata – descriptive, administrative and structural ones. All metadata are being
184
A. Zhuchkov et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
stored in the same XML-document but constitute different partitions of it. METS provides abilities to describe all sorts of metadata of digital objects. The format does not prescript a content of metadata but recommends a basic schema of a document. Note, that metadata as well as content data (e.g., texts, pictures, videos, etc.) can be stored either inside of the METS-formatted XML-file or can be pointed out by outside links. One more reason to choose namely the METS format is originated in the following advantage. Descriptive and structural metadata of an IO are considered as a part of content. Hence, only system metadata should be inserted by hands by author or administrator of the VC whilst descriptive and structural metadata can be added automatically by services based on information that is kept in configuration files of IOs which constitute this VC (Fig. 1). metadata are filled w itch the author(creator) of collection
m etadescription of information object in a form at of Dublin Core
m etadescription are filled from file dataResourceConfig.xml
metadata are filled w itch the administrator of collection
Figure 1. Creation of meta-description of a Virtual Collection
Thus, the process of creation metadata for biomedical VCs of CN “New Generation of Vaccines and Medical Diagnostic Systems” during initial filling of RMD in BiblioGrid has been automated. The service that has been developed using Perl parses configuration XML-files of IOs, selects metadata sections and puts their copies into METS-formatted XML-files of the correspondent VC in the RMD. Normally, at least two persons add metadata into meta-description of a VC – the author and VC (or RMD) administrator. However, if the VC permits write-access for
A. Zhuchkovm et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
185
more than one user then there may be several sections of metadata including those ones which have been generated automatically by appropriate software services.
5. Intellectual data search in Virtual Collections via Grid-services Proposed technology virtualizes data processing when operating with VCs. The process of interaction is shown on Fig. 2. A user sends to the RMD a query to find appropriate metadata. OGSA-DAI services find correspondent set of IOs of the required VC and either send a set of metadata of these IOs back to user or GDS interacts with data resources using metadata in IOs and presents to user the required data according the query along with correspondent metadata (Fig. 2). Request to Repository of metadescription about “x”
GDS (Repository of metadescription)
USER
metadata
Request to Virtual Collection for data
final data set + metadata
GDS (Virtual collection of information objcts) GDS interacts w ith Sedna database
Data resource of repository (Sedna)
...
Data resource #n (files)
Data resource #2 (PostgreSQL)
Data resource #1 (XMLDB)
GDS interacts w ith data resources
Figure 2. Interaction with data resources via GDS by using meta-description of the Virtual Collection
Note, that any IOs keeps information where its own content data are allocated and how to get them by using GDS. Hence, any query will be performed in the virtual information space that is defined by metadata stored in the RMD. OGSA-DAI services perform the search through all available heterogeneous resources – different databases, Web-sites, etc., and user can concentrate on the subject of the research without considering technical details of the data search. The GUI that is used by interaction between users and GDS has been inherited from previous non-Grid version of CN “New Generation of Vaccines and Medical Diagnostic Systems”. This GUI provides Windows-interface to the information space and also includes tools to construct subject-oriented ontologies which serve as the base for semantic linking IOs into subject-oriented VCs.
186
A. Zhuchkov et al. / Advancing of Russian ChemBioGrid by Bringing Data Management Tools
Conclusion Biomedical researches do requires high-throughput computations and computational Grid looks like it is already at hands. However, biomedical researches require advanced data management tools even in the greater extent and this is today a weak place of Grids. This paper describes an approach that provides opportunities to manage collaboratively huge sets of data in Grids by using existing Grid-middleware what is one of key requirements for accelerating Grid deployments within biomedical research institutions and commercial firms. The proposed approach combines opportunities of Grid technology regarding information safety in collaborative work with abilities of intellectual data search using virtualization of data resources and Virtual Collections based on extended metadata. The usage of metadata of Information Objects and Virtual Collections provides OGSADAI services with abilities to perform effective intellectual search over all heterogeneous distributed information resources.
References [1] [2] [3] [4] [5] [6]
[7] [8]
The 451 Group: Data Mgmt Limitations Delay Grid Deployments. GRID Today, August 22, 2005. The OGSA-DAI Project, http://www.ogsadai.org. A. Joutchkov., et al. Grid-Based Onto-Technologies Provide an Effective Instrument for Biomedical Research. Studies in Health Technology and Informatics, 2005; 112:37-46. I. Foster, C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Pub., San Francisco, CA. 1999. Metadata Encoding & Transmission Standard (METS), http://www.loc.gov/standards/mets. A. Joutchkov., et al. Development of an interdisciplinary fragment of the Russian GRID segment: state of the art. VIII Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT’ 2002. Moscow, MSU, 2002:30. Sedna – Native XML DBMS, http://www.modis.ispras.ru/Development/sedna.htm. XUpdate - XML Update Language, http://www.smb-tec.com/xmldb/xupdate/.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
187
GPS@ Bioinformatics Portal: from Network to EGEE Grid a
Christophe Blanchet a,1 , Vincent Lefort a, Christophe Combet a and Gilbert Deléage a Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1; IFR128 BioSciences Lyon-Gerland; 7, passage du Vercors, 69007 Lyon, France
Abstract: Bioinformatics analysis of data produced by complete genome sequencing projects is one of the major challenges of the current years. Integrating up-to-date databanks and relevant algorithms is a clear requirement of such analysis. Grid computing would be a viable solution to distribute data, algorithms, computing and storage resources for Genomics. Providing bioinformaticians with a good interface to grid infrastructure, such as the one provided by the EGEE European project, is also a challenge to take up. The GPS@ web portal, “Grid Protein Sequence Analysis”, aims to provide such a user-friendly interface for these grid genomic resources on the EGEE grid.
Keywords: Bioinformatics, Grid computing, Tool integration, Web portal
Introduction Bioinformatics analysis of data produced by high-throughput biology, for instance genome projects [1], is one of the major challenges for the next years. Some of the requirements of this analysis are to access up-to-date databanks (of sequences, patterns, 3D structures, etc.) and relevant algorithms (for sequence similarity, multiple alignment, pattern scanning, etc.) [2]. Since 1998, we are developing the Web server NPS@ ([3], Network Protein Sequence Analysis), that provides the biologist with many of the most common resources for protein sequence analysis, integrated into a common workflow. These methods and data can be accessed through a HTTP connexion with a web browser, or bioinformatics program like MPSA [4] or AntheProt [5]. Today, the computing resources available behind the NPS@ Web portal limit the capabilities of our server as well as other genomics/post-genomics web portals. Indeed some methods are very computing-time and memory consuming. All these web portals have to face to an increasing demand of CPU and disk resources and to the management of the bioinformatics resources (algorithms, databanks). Most of the time, the portal administrators restrict user queries following different levels of access rights on the available methods and databanks. Grid computing concept [6], as deployed in the European EGEE project [7], may be a viable solution to foresee these resources limitations [8] [9]. EGEE’s goals are to build a European grid infrastructure, providing today users with more than 15,000 1 Corresponding author: Institut de Biologie et Chimie des Protéines (IBCP UMR 5086), 7 passage du Vercors, 69007 Lyon, France; Christophe.Blanchet@ibcp.fr
188
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
CPUs. These resources are usable for grid users through specific components of this middleware: the user interface (UI), the job description language (JDL) and job workload management commands. Nevertheless EGEE user interface and usage are still raw and hardly accessible to non-computer scientist.
1. EGEE: European grid infrastructure 1.1. European project EGEE The Enabling Grids for E-sciencE (EGEE [7]) project is funded by the European Commission and aims to build on recent advances in grid technology and develop a service grid infrastructure. EGEE aims to integrate current national, regional and thematic computing and data Grids to create a European Grid-empowered infrastructure for the support of the European Research Area, exploiting unique expertise generated by previous EU projects (DataGrid, CrossGrid, DataTAG, etc.) and national Grid initiatives (UK e-Science, INFN Grid, Nordugrid, GridIreland, etc.). The EGEE consortium involves 70 leading institutions in 27 countries, federated in regional Grids, with a combined capacity of over 15,000 CPUs, the largest international Grid infrastructure ever assembled. 1.2. EGEE infrastructure The project EGEE is building a grid computing platform as it usually defined [6]: “a grid is a set of information resources (computers, databases, networks, instruments, etc.) that are integrated to provide users with tools and applications that treat those resources as components within a « virtual » system”. EGEE middleware provides the underlying mechanisms necessary to create such systems, including authentication and authorization, resource discovery, network connections, and other kind of components. The platform is built on the LCG-2 middleware (Large Collisioner Grid release 2), which has been inherited from the EDG middleware developed by the European DataGrid Project ([10], FP5 2001-2003), initially based on the Globus toolkit [11]. The EGEE middleware permits grid users to launch a job on the EDG grid through an User Interface (UI). Then the job is processed by the workload management system in the Resource Broker (RB). This component, RB, determines where and when the submitted job have to be computed: on which given computing element and according to the needed storage element in case of simple jobs, using several of them in case of large jobs. A computing element (CE) is a cluster of several computing servers, the worker nodes (WN) managed by a scheduler system using batch mechanisms, such as PBS/Torque. A storage element (SE) is a server providing a storage space usable for distributing the application data around the grid. The resource broker knows the current state of the grid by querying the information system that centralizes all parameters raised by the grid components (cluster, storage, network,). When available resources have been chosen, the job is transferred to these components and launched. Once executed, the resource broker is informed and gets it back to the user interface.
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
189
1.3. EGEE usage: workload and data management The usage of the EGEE middleware is still raw and hardly accessible to noncomputer scientist. Firstly, the user has to connect to an user interface (UI) machine: getting an account on an existing UI or ,harder way , installing one in its laboratory. The UI needs a dedicated Linux machine, and the installation of the LCG-2 middleware is manual and needs some skills in system administration. Secondly, when the UI is up and ready, a grid user has to deal with the middleware command line interface (CLI) providing different sets of programs to manage job or data. Moreover, submitting a job means to write a valid JDL (Job Description Language) file describing completely the job to be run on the grid. The principal actions needed to run a job are the following: job submission (edg-job-submit) getting status (edg-job-status) and downloading results (edg-job-get-output). For data management the main programs are: data registration (lcg-cr), replication (lcg-rep) and data suppression (lcg-del). All these command are not integrated and need to be executed manually by the user.
2. Bioinformatics portal on the grid As seen, the current job submission process on the EGEE platform is relatively complex, as well as non-automated, for non-computer scientist. Indeed, biologists who are using the grid have to submit their jobs manually, and have to check periodically the resource broker for the status of his job. A job goes through different steps during the workload management process: “Submitted", "Ready", “Scheduled”, “Running”, etc. until the “Done” status. And finally, they have to get the results with a raw file transfer from the remote storage area to the local file system of their user interface. 2.1. Gridification of Bioinformatics data and programs. One major problem with a grid-computing infrastructure is the distribution of files and binaries, as BLAST [12] or ClustalW [13] through the job submission process. Sending a binary of the algorithm to a node on the grid is quite simple because of its size, few kilobytes, and can be done at each execution, although it isn’t the best efficient way to do it. But putting on the grid a databank, from tens of megabytes (as Swiss-Prot [14]) to gigabytes (as EMBL [15]), consumes a large part of network bandwidth if done at each job submission, and greatly enlarge the execution time if done each time a BLAST is submitted to the grid. One simple solution can be to split databanks into subsets sent in parallel to several node of the grid, in order to run the same query on each subset. Another solution is to maintain used databanks on several storage elements (SE) of the grid and to launch the algorithm on computing resources (worker nodes) closer to these SEs. According to these two solutions, the submission process is different. The algorithm submission processes implemented in our GPS@ portal have been adapted to the EGEE grid context. The algorithms and short datasets are sent at submission time through the grid sandbox process. While the other ones, algorithms analyzing large dataset are executed on grid nodes close to the related databanks, that have been replicated earlier or on demand through the replica management system.
190
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
2.2. GPS@ - Grid Protein Sequence Analysis. The EGEE job submission could be boring for scientists that are not aware of advanced computing techniques. Thus, we decide to provide biologists with a userfriendly interface for the EGEE computing and storage resources, by adapting our NPS@ web portal [3]. The grid portal GPS@ (“Grid Protein Sequence Analysis) simplify and automated the EGEE grid job submission and data management mechanisms with XML descriptions of available Bioinformatics resources: algorithms and databanks (see Figure 1).
Figure 1. GPSA architecture and interface to the EGEE grid.
In GPS@, we simplify the grid analysis query: GPS@ Web portal runs its own EGEE low-level interface and provides biologists with the same interface that they are using daily in NPS@. They only have to paste their protein sequences or patterns into the corresponding field of the submission web page. Then simply pressing the “submit” button launches the execution of these jobs on the EGEE platform. All the EGEE job submission is encapsulated into the GPS@ back office: scheduling and status of the submitted jobs. And finally the result of the bioinformatics jobs are displayed into a new Web page (see Figure 2), ready for other analyses or for results download in the appropriate data format.
3. Example of use: submitting BLAST analyses to the EGEE grid NPS@ [3] is providing biologist with a Web form to input their data (protein sequences) in order to run a BLAST analysis against a given protein sequence database. As in Figure 2, the user simply paste is sequence of protein in the corresponding field. Then he chooses the database that will be scan with the query sequence. All the available protein databanks can be selected through a multi-valued list of the form.
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
191
Selecting the “EGEE” check-box will schedule the submission of the BLAST on the EGEE grid when clicking on the “submit” button.
Figure 2. GPS@ web portal: submission form of a bioinformatics analysis with BLAST.
As the GPS@ portal integrated is own EGEE user interface (see Figure 1), an automatized process then submits the BLAST job on the grid. First, the job description in the Web form is converted into a JDL file, that can then be submitted to the workload management system of EGEE. The GPS@ sub-process that have submitted the job, is also checking periodically the status of this job by querying the resource broker with the good commands. All steps are notified to the user through the Web page of the submission, indicating the time and the duration of the current step. When achieved, i.e. reaching the “Done” step, the GPS@ automat is downloading the result file from BLAST. Then this raw result file in BLAST format is processed and converted into a HTML page showing, in a colored and graphical way, the list of similar protein sequences, and also graph and pairwise alignments of them (as in Figure 3). This formatting process is directly inherited from the original NPS@ portal, providing biologists with a well-known interface and way of displaying results.
192
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
Figure 3. GPS@ web portal: results of a BLAST scan for protein sequence similarity.
4. Conclusion GPS@ grid web portal (Grid Protein Sequence Analysis, http://gpsa-pbil.ibcp.fr) is a Bioinformatics integrated portal such as the current NPS@ protein portal, and would provide the biologist with a user-friendly interface for the GRID resources (computing and storage) made available by the project EGEE (2004-2005). This genomic grid user interface hides the mechanisms involved for the execution of Bioinformatics analyses on the grid infrastructure. The bioinformatics algorithms and databanks have been distributed and registered on the EGEE grid and GPS@ runs its own EGEE interface to the grid. In this way, GPS@ portal simplify the Bioinformatics grid submission, and provide biologist with the benefit of the EGEE grid infrastructure to analyze large biological dataset: e.g. including several protein secondary structure predictions into a multiple alignment, or clustering a sequence set by analyzing, with BLAST or SSEARCH, each sequence against the others, …
C. Blanchet et al. / GPS@ Bioinformatics Portal: From Network to EGEE Grid
193
In the future, main efforts should be focused on taking bioinformatics specific constraints and requirements into account on the EGEE grid. That means, for example, including ontology and semantic parameters into the gridified data with the replica manager system. An other effort should concern the security of the bioinformatics data and methods on the grid: encryption of data, network isolation and algorithm execution sandboxing, fine grain access to data, monitoring private data transfer and replication, etc.
Acknowledgements This work has been funded by GriPPS project (ACI GRID PPL02-05), EGEE project (EU FP6, ref. INFSO-508833) and EMBRACE Network of Excellence (EU FP6, LHSG-CT-2004-512092).
References [1]
Bernal, A., Ear, U., Kyrpides, N. : Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. NAR 29 (2001) 126-127 [2] G. Perrière, C. Combet, S. Penel, C. Blanchet, J. Thioulouse, C. Geourjon, J. Grassot, C. Charavay, M. Gouy, L. Duret and G. Deléage, Integrated databanks access and sequence/structure analysis services at the PBIL. Nucleic Acids Res., 31:3393-3399, 2003. [3] Combet, C., Blanchet, C., Geourjon, C. et Deléage, G. : NPS@: Network Protein Sequence Analysis. Tibs, 25 (2000) 147-150. [4] Blanchet, C., Combet, C., Geourjon, C. et Deléage, G. : MPSA: Integrated System for Multiple Protein Sequence Analysis with client/server capabilities. Bioinformatics, 16 (2000) 286-287. [5] Deleage, G, Combet, C, Blanchet, C, Geourjon, C. : ANTHEPROT: an integrated protein sequence analysis software with client/server capabilities. Comput Biol Med., 31 (2001) 259-267 [6] Foster, I. And Kesselman, C. (eds.) : The Grid: Blueprint for a New Computing Infrastructure, (1998). [7] EGEE – Enabling Grid for E-science in Europe; http://www.eu-egee.org [8] Vicat-Blanc Primet, P., d’Anfray, P., Blanchet, C., Chanussot, F. : e-Toile : High Performance Grid Middleware. Proceedings of Cluster’2003 (2003). [9] Jacq, N., Blanchet, C., Combet, C., Cornillot, E., Duret, L., Kurata, K., Nakamura, H., Silvestre, T., Breton, V. : Grid as a Bioinformaticstool. , Parallel Computing, special issue: High-performance parallel bio-computing, Vol. 30, (2004). [10] EDG - European DataGrid project, http://www.eu-datagrid.org [11] GLOBUS Project, http://www.globus.org/ [12] Altschul, SF, Gish, W, Miller, W, Myers, EW, Lipman, DJ : Basic local alignment search tool. J. Mol. Biol. 215 (1990) 403-410 [13] Thompson, JD, Higgins, DG, Gibson, TJ : CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 (1994) 4673-4680. [14] Bairoch, A, Apweiler, R : The SWISS–PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27 (1999) 49-54 [15] Stoesser, G, Tuli, MA, Lopez, R, Sterk, P : the EMBL nucleotide sequence database. Nucleic Acids Res. 27 (1999) 18-24.
194
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
%ODVW*2JRHV*ULG'HYHORSLQJD*ULG (QDEOHG3URWRW\SHIRU)XQFWLRQDO *HQRPLFV$QDO\VLV *$SDULFLR6*|W]$&RQHVD'6HJUHOOHV,%ODQTXHU-0*DUFtD 9+HUQDQGH]05REOHV07DORQ
,QVWLWXWRGH$SOLFDFLRQHVGHODV7HFQRORJtDVGHOD,QIRUPDFLyQ\GHODV&RPXQLFDFLRQHV $YDQ]DGDV8QLYHUVLGDG3ROLWpFQLFDGH9DOHQFLD&DPLQRGH9HUD619DOHQFLD6SDLQ SKRQHH[WHPDLOJDSDULFLR#LWDFDXSYHVLEODQTXH#GVLFXSYHV YKHUQDQG#GVLFXSYHVGTXLOLV#LWDFDXSYHV
&HQWURGH*HQyPLFD,QVWLWXWR9DOHQFLDQRGH,QYHVWLJDFLRQHV$JUDULDV0RQFDGD9DOHQFLD 6SDLQSKRQHH[WHPDLODFRQHVD#LYLDHVPWDORQ#LYLDHV %(7,7$&$8QLYHUVLGDG3ROLWpFQLFDGH9DOHQFLD &DPLQRGH9HUD619DOHQFLD6SDLQSKRQHH[W HPDLOVWHIDQJ#ILVXSYHVMXDQPLJ#LWDFDXSYHVPUREOHV#ILVXSYHV
$EVWUDFW 7KH YDVW DPRXQW LQ FRPSOH[LW\ RI GDWD JHQHUDWHG LQ *HQRPLF 5HVHDUFK LPSOLHV WKDW QHZ GHGLFDWHG DQG SRZHUIXO FRPSXWDWLRQDO WRROV QHHG WR EH GHYHORSHGWRPHHWWKHLUDQDO\VLVUHTXLUHPHQWV%ODVW*2%* LVDELRLQIRUPDWLFV WRRO IRU *HQH 2QWRORJ\EDVHG '1$ RU SURWHLQ VHTXHQFH DQQRWDWLRQ DQG IXQFWLRQ EDVHGGDWDPLQLQJ7KHDSSOLFDWLRQKDVEHHQGHYHORSHGZLWKWKHDLPRIRIIHULQJDQ HDV\WRXVH WRRO IRU IXQFWLRQDO JHQRPLFV UHVHDUFK 7\SLFDO %* XVHUV DUH PLGGOH VL]HJHQRPLFVODEVFDUU\LQJRXWVHTXHQFLQJ(76DQGPLFURDUUD\SURMHFWVKDQGOLQJ GDWDVHWVXSWRVHYHUDOWKRXVDQGVHTXHQFHV,QWKHFXUUHQWYHUVLRQRI%*WKHSRZHU DQG DQDO\WLFDO SRWHQWLDO RI ERWK DQQRWDWLRQ DQG IXQFWLRQ GDWDPLQLQJ LV VRPHKRZ UHVWULFWHGWRWKHFRPSXWDWLRQDOSRZHUEHKLQGHDFKSDUWLFXODULQVWDOODWLRQ,QRUGHUWR EH DEOH WR RIIHU WKH SRVVLELOLW\ RI DQ HQKDQFHG FRPSXWDWLRQDO FDSDFLW\ ZLWKLQ WKLV ELRLQIRUPDWLFV DSSOLFDWLRQ D *ULG FRPSRQHQW LV EHLQJ GHYHORSHG $ SURWRW\SH KDV EHHQ FRQFHLYHG IRU WKH SDUWLFXODU SUREOHP RI VSHHGLQJ XS WKH %ODVW VHDUFKHV WR REWDLQ IDVW UHVXOWV IRU ODUJH GDWDVHWV0DQ\ HIIRUWV KDYH EHHQGRQH LQWKHOLWHUDWXUH FRQFHUQLQJWKHVSHHGLQJXSRI%ODVWVHDUFKHVEXWIHZRIWKHPGHDOZLWKWKHXVHRI ODUJH KHWHURJHQHRXV SURGXFWLRQ *ULG ,QIUDVWUXFWXUHV 7KHVH DUH WKH LQIUDVWUXFWXUHV WKDWFRXOGUHDFKWKHODUJHVWQXPEHURIUHVRXUFHVDQGWKHEHVWORDGEDODQFLQJIRUGDWD DFFHVV 7KH *ULG 6HUYLFH XQGHU GHYHORSPHQW ZLOO DQDO\VH UHTXHVWV EDVHG RQ WKH QXPEHURIVHTXHQFHVVSOLWWLQJWKHPDFFRUGLQJO\WRWKHDYDLODEOHUHVRXUFHV/RZHU OHYHOFRPSXWDWLRQZLOOEHSHUIRUPHGWKURXJK03,%/$677KHVRIWZDUHDUFKLWHFWXUH LVEDVHGRQWKH:65)VWDQGDUG
,1752'8&7,21$1'027,9$7,21 7KH DUULYDO RI WKH JHQRPLF WHFKQRORJLHV WR ELRPHGLFDO UHVHDUFK KDV UHVXOWHG LQ D GUDVWLFFKDQJHLQERWKWKHPDJQLWXGHRIWKHDYDLODEOHELRPROHFXODUGDWDDQGLQWKHZD\ VFLHQWLILF TXHVWLRQV FDQ QRZ EH DGGUHVVHG 7KH FRPSOHWLRQ RI WKH VHTXHQFLQJ RI WKH
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
195
KXPDQ DQG RI PDQ\ RWKHU RUJDQLVPV JHQRPH KDV UHVXOWHG LQ DQ H[SORVLRQ RI PROHFXODU ELRORJ\ GDWDEDVHV ZKLFK KDYH EHHQ PDGH DFFHVVLEOH WR WKH VFLHQWLILF FRPPXQLW\6LPLODUO\WKHJHQHUDOL]DWLRQLQWKHXVHRIH[SHULPHQWDO±RPLFVDSSURDFKHV WUDQVFULSWRPLFVSURWHRPLFVPHWDERORPLFV« KDVRSHQHGQHZSHUVSHFWLYHVIRUJOREDO VWXGLHV RQ WKH EHKDYLRXU RI FHOOXODU FRPSRQHQWV DQG IRU WKH XQGHUVWDQGLQJ RI WKH PROHFXODU LQWHUDFWLRQV WKDW VXSSRUW OLIH )XUWKHUPRUH WKHVH H[FLWLQJ SRVVLELOLWLHV GLUHFWO\ WUDQVODWH LQWR QHZ FKDOOHQJHV IRU WKH GDWD SURFHVVLQJ VFLHQFHV ,Q VRPH FDVHV ODUJHGDWDVHWVZRXOGQHHGWREHSURFHVVHGE\UHSHDWHGO\DSSO\LQJDQDO\WLFDOURXWLQHVWR HDFKGDWDFRPSRQHQW,QVRPHRWKHUVWKHWDVNZLOOEHWKHXVHRILQWHQVLYHGDWDPLQLQJ SURFHGXUHV IRU H[WUDFWLQJ UHOHYDQW LQIRUPDWLRQ 7KH ELRLQIRUPDWLFV GLVFLSOLQH KDV GHYHORSHG IURP WKH QHFHVVLW\ RI SURYLGLQJ DGHTXDWH FRPSXWDWLRQDO DQG PDWKHPDWLFDO VROXWLRQVWRWKHDQDO\VLVRIWKLVXQSUHFHGHQWHGDPRXQWRIELRORJLFDOGDWD %ODVW*2 >@ %* LV D ELRLQIRUPDWLFV WRRO IRU *HQH 2QWRORJ\EDVHG '1$ RU SURWHLQVHTXHQFHDQQRWDWLRQDQGIXQFWLRQEDVHGGDWDPLQLQJDGHWDLOHGGHVFULSWLRQRI WKH WRRO LV JLYHQ LQ WKH QH[W VHFWLRQ 7KH DSSOLFDWLRQ KDV VXFFHVVIXOO\ EHHQ XVHG LQ PDQ\ IXQFWLRQDO JHQRPLFV SURMHFWV IRU WKHVH WZR FRUH IXQFWLRQDOLWLHV 7\SLFDO %* XVHUV DUH PLGGOH VL]H JHQRPLFV ODEV FDUU\LQJ RXW VHTXHQFLQJ (76 DQG PLFURDUUD\ SURMHFWVKDQGOLQJGDWDVHWVXSWRVHYHUDOWKRXVDQGVHTXHQFHV,QWKHFXUUHQWYHUVLRQRI %*WKHSRZHUDQGDQDO\WLFDOSRWHQWLDORIERWKDQQRWDWLRQDQGIXQFWLRQGDWDPLQLQJLV VRPHKRZOLPLWHGE\WKHFRPSXWDWLRQDOSRZHUEHKLQGHDFKSDUWLFXODULQVWDOODWLRQ $QQRWDWLRQ KDV LWV PDMRU ERWWOHQHFN LQ WKH ILUVW DQDO\VLV VWHS WKDW LPSOLHV WKH VHDUFK RI ODUJH '1$ RU SURWHLQ GDWDEDVHV IRU VHTXHQFH KRPRORJXHV ZLWK WKH %DVLF /RFDO$OLJQPHQW6HDUFK7RRO%/$67 DOJRULWKP>@7KLVLVWKH PRVWH[WHQGHGEXW QRWWKHRQO\PHWKRGIRUILQGLQJIXQFWLRQDOLQIRUPDWLRQIRUXQFKDUDFWHUL]HGVHTXHQFHV 0RVW %* XVHUV HPSOR\ WKH GHIDXOW DSSOLFDWLRQ RSWLRQV ZKLFK JLYH UHPRWH DFFHVV WR 1&%, ,Q VRPH PRUH ELRLQIRUPDWLFV H[SHULHQFHG ODERUDWRULHV ORFDO EODVW LQVWDOODWLRQV DUHVHWXSZKHUHGDWDVRXUFHVDUHTXHULHGORFDOO\,QDQ\FDVHEODVWLQJZLOOWDNHIURP VHFRQGV WR D IHZ PLQXWHV SHU VHTXHQFH WR FRPSOHWH ZKLFK PHDQV WKDW ZKHQ WKRXVDQGVHTXHQFHVDUHLQYROYHGWKLVLQLWLDOSURFHVVFDQODVWXSWRVHYHUDOGD\V 2Q WKH RWKHU KDQG VWDWLVWLFDO GDWDPLQLQJ PHWKRGV FXUUHQWO\ RIIHUHG E\ WKH DSSOLFDWLRQ DUH UHVWULFWHG WR GHVFULSWLYH DQG XQLYDULDWH IXQFWLRQV 7KLV SDUW RI WKH WRRO FRXOGJUHDWO\EHQHILWIURPRWKHUSRZHUIXOVWDWLVWLFDODSSURDFKHVRILQFUHDVLQJLQWHUHVWLQ IXQFWLRQDO JHQRPLFV DQDO\VLV VXFK DV PDFKLQHOHDUQLQJ EDVHG RU WKRVH LQYROYLQJ VWRFKDVWLFVHDUFKHVLQWKHPXOWLYDULDWHVSDFH+RZHYHUWKHLULQFRUSRUDWLRQLQWRDKLJKO\ LQWHUDFWLYH WRRO DV %* DSSHDUV QRW YHU\ DSSHDOLQJ DV WKH\ DUH KLJKO\ &38 LQWHQVLYH DQGWLPHFRQVXPLQJPHWKRGRORJLHV 7KH VHTXHQFHEDVHG VWUXFWXUH RI WKH GDWDVHWV XVHG E\ %* DQG E\ PDQ\ RWKHU IXQFWLRQDO JHQRPLFV DQDO\VLV WRROV VXJJHVWV WKDW SURFHVV SDUDOOHOLVDWLRQ FRXOG EH D VXLWDEOH VROXWLRQ WR WKH SUREOHP RI ORQJ FRPSXWLQJ WLPHV DQG KHDY\ FRPSXWDWLRQDO WDVNV6XEVWDQWLDOO\IDVWHUGDWDSURFHVVLQJZRXOGQRWRQO\UHVXOWLQDQLPSURYHPHQWRI WKHSHUIRUPDQFHRIWKHWRROEXWFRXOGDOVRRSHQWKHGRRUWRH[SHULPHQWDWLRQZLWKLQWKH DQDO\WLFDOURXWLQHV ,Q WKLV SDSHU ZH SUHVHQW RXU DSSURDFK IRU LQFRUSRUDWLQJ *ULG WHFKQRORJ\ LQWR D IXQFWLRQDO JHQRPLFV WRRO VXFK DV %* 3URWRW\SH GHYHORSPHQW ZLOO EH GRQH IRU WKH SDUWLFXODU SUREOHP RI VSHHGLQJ XS WKH %ODVW VHDUFKHV WR DFKLHYH IDVW UHVXOWV IRU ODUJH GDWDVHWV$IXWXUHIROORZXSDUHDRI*ULGWHFKQRORJ\DSSOLFDWLRQZLOOEHWKHLQWHJUDWLRQ RIDGYDQFHGDQGFRPSXWLQJLQWHQVLYHVWDWLVWLFDODOJRULWKPV7KHFRQQHFWLYLW\WRD*ULG HQYLURQPHQW ZLOO HDVH WKH IXWXUH PLJUDWLRQ WR WKH *ULG RI KHDY\ ZHLJKW FRPSXWLQJ SURFHVVHV
196
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
67$7(2)7+($57 7KLVVHFWLRQILUVWO\GHVFULEHVWKH%*DSSOLFDWLRQWKDWZLOOWUDQVSDUHQWO\FRQQHFW XVHUV WR WKH *ULG EDVHG RQ WKH VRIWZDUH DUFKLWHFWXUH SUHVHQWHG LQ WKLV SDSHU 1H[W D UHYLHZRQWKHPDLQLVVXHVUHODWHGWRRWKHU*ULGSURMHFWVDGGUHVVLQJELRLQIRUPDWLFVDQG WKHLUFXUUHQWOLPLWDWLRQVZLOOEHJLYHQ7KHVRIWZDUHDUFKLWHFWXUHSURSRVHGKHUHDLPVDW PDNLQJ SURILW RI WKH H[SHULHQFH JDLQHG IURP WKRVH SURMHFWV DQG SURSRVHV QHZ IXQFWLRQDOLWLHVQRWFRYHUHGE\WKHP %ODVW*R %ODVW*2LVD-DYDDSSOLFDWLRQFRQFHLYHGDQGGHVLJQHGZLWKWKHDLPRISURYLGLQJ WKH )XQFWLRQDO *HQRPLFV 5HVHDUFK FRPPXQLW\ ZLWK DQ HDV\WRUXQ WRRO IRU VHTXHQFH DQQRWDWLRQDQGJHQH IXQFWLRQEDVHGGDWD PLQLQJ7KHDSSOLFDWLRQUHOLHVRQ ILYH PDMRU LQWHUDFWLYH DQDO\VLV SURFHVVHV WKDW WRJHWKHU SURYLGH WKHVH WZR PDLQ IXQFWLRQDOLWLHV 8VXDOO\ %* XVHUV VWDUW WKHLU DQDO\VLV E\ UXQQLQJ D %/$67 VHDUFK 7KH %DVLF /RFDO $OLJQPHQW6HDUFK7RROLVWKHXQLYHUVDODOJRULWKPIRUWKHTXHU\LQJRISURWHLQDQG'1$ GDWDEDVHV IRU VHTXHQFH VLPLODULWLHV $ JURXS RI VHOHFWHG VHTXHQFHV LV EODVWHG DJDLQVW HLWKHU SXEOLF RU FXVWRP GDWDEDVHV WR REWDLQ VRFDOOHG KRPRORJXHV VLPLODU VHTXHQFHV WKDWGHULYHIURPDFRPPRQDQFHVWRUDQGSXWDWLYHO\VKDUHDFRPPRQIXQFWLRQ7KLVILUVW VWHSFDQEHPRGXODWHGE\WKHXVHUWKURXJKWKHDGMXVWPHQWRIYDULRXVSDUDPHWHUVRIWKH DOJRULWKPVXFKDVWKHPLQLPXPVLPLODULW\EHWZHHQVHTXHQFHVWKHQXPEHURIUHWXUQHG KLWV RU WKH GDWDEDVH WR EH VHDUFKHG %* VXSSRUWV GLIIHUHQW ZD\V IRU UXQQLQJ %ODVW VHDUFKHV1&%,VHUYLFHVFDQ EHDFFHVVHGWKURXJKKWWSWRREWDLQUHVXOWVDJDLQVWSXEOLF GDWDVHWVZLWKRXWDQ\PDLQWHQDQFHZRUN$OWHUQDWLYHO\DORFDOEODVWLQVWDOODWLRQFDQEH DFFHVVHG WKURXJK %* UXQQLQJ RQ FXVWRP VHTXHQFH GDWDEDVHV ZLWK WKH UHVWULFWLRQ RI ORZ SHUIRUPDQFH DQG PDLQWHQDQFH HIIRUWV OLNH GDWDEDVH XSGDWHV 7\SLFDO XVHV RI %ODVW*2LQFOXGHGDWDVHWVRIVHYHUDOWKRXVDQGRIVHTXHQFHV7KHDSSOLFDWLRQODXQFKHV %ODVW VHDUFKHV VHTXHQWLDOO\ DQG UHVXOWV DUH SURFHVVHG DV WKH\ DUH UHWULHYHG E\ WKH SDUVLQJ PRGXOH7KHWLPH QHFHVVDU\WRREWDLQD%ODVWUHVXOWLVVHTXHQFHDQGGDWDEDVH GHSHQGHQWDQGFDQYDU\EHWZHHQVHFRQGVDQGDIHZPLQXWHVZKLFKWXUQV%ODVWLQJ LQWR WKH UDWHOLPLWLQJ VWHS RI WKH DQDO\VLV SURFHVV 7KH VHFRQG VWHS LV WKH UHWULHYDO RI DOUHDG\ NQRZQ ELRORJLFDO IXQFWLRQV DQG RWKHU DQQRWDWLRQV IRU WKH IRXQG KRPRORJXHV +HUH%*PDNHVXVHRIWKH*HQH2QWRORJ\*2 DVWUXFWXUHGYRFDEXODU\RIELRORJLFDO WHUPV *2 FDQ EH FRQVLGHUHG DV WKH GH IDFWRVWDQGDUG IRU IXQFWLRQDO VHTXHQFH DQG JHQRPH DQQRWDWLRQ IRXQG LQ PRVW VHTXHQFH GDWDEDVHV H[SORLWHG E\ WKH VFLHQWLILF FRPPXQLW\ 'XULQJ WKH WKLUG RU DQQRWDWLRQ VWHS WKH FROOHFWHG *2 LQIRUPDWLRQ LV HYDOXDWHG E\ D XVHUDGMXVWDEOH UXOH ZKLFK ILQDOO\ DVVLJQV *2 WHUPV WR WKH TXHU\ VHTXHQFHV 2QFH *2 DQQRWDWLRQV DUH JHQHUDWHG IRU WKH TXHU\ VHTXHQFHV WKH VHFRQG IXQFWLRQDOLW\RI%*EHFRPHVDFWLYH6LQJOHRUJURXSDQQRWDWLRQVFDQEHGLVSOD\HGLQ JUDSKIRUP UHFRQVWUXFWLQJ WKH *2WHUP UHODWLRQVKLSV LQKHULWHG IURP WKH RQWRORJLFDO VWUXFWXUH DQG FRORXUKLJKOLJKWLQJ WKH PRVW UHOHYDQW WHUPV WR HDVH ELRORJLFDO LQWHUSUHWDWLRQ 6WDWLVWLFDO DQDO\VHV DUH DGGLWLRQDOO\ RIIHUHG E\ WKH WRRO WR LGHQWLI\ HJ GLIIHUHQFHV LQ *2 WHUP GLVWULEXWLRQ EHWZHHQ VXEVHWV RI VHTXHQFHV 7KLV LV D NH\ DQDO\VLV LQ IXQFWLRQDO JHQRPLFV H[SHULPHQWV ZKHUH WKH UHODWLYH LPSRUWDQFH RI LQGLYLGXDO ELRORJLFDO SURFHVVHV ZLWKLQ WKH REWDLQHG UHVXOWV QHHGV WR EH HYDOXDWHG $W HDFKRIWKHVHDERYHVWHSVGLIIHUHQWFKDUWVDUHDYDLODEOHWRHYDOXDWHWKHSURJUHVVRIWKH DQDO\VLVDQGGDWDFDQEHVDYHGDQGH[SRUWHGLQGLIIHUHQWIRUPDWV
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
197
6LQFHLWVSXEOLFDWLRQLQ6HSWHPEHULQ%LRLQIRUPDWLFV%*KDVEHHQXVHGLQ QXPHURXV IXQFWLRQDO JHQRPLFV SURMHFWV 7KH DYDLODELOLW\ RI D IHHGEDFN IRUP DW WKH DSSOLFDWLRQ¶VZHEVLWHDQGDXVHU¶VJURXSKDVDOORZHGXVWRWUDFNWKHXVHRIWKHWRRODQG FROOHFWRSLQLRQVDQGVXJJHVWLRQVIURPWKHXVHU¶VFRPPXQLW\2XUJHQHUDOLPSUHVVLRQLV WKDW UHVHDUFKHUV DUH SOHDVHG ZLWK WKH XVHUIULHQGO\ VHWXS EXW VLQFH QRUPDOO\ ODUJH GDWDVHWVDUHXVHGZDLWLQJWLPHVEHWZHHQDQDO\VLVVWHSVFDQEHORQJHVSHFLDOO\GXULQJ WKH EODVW VHDUFK ,PSURYHPHQWV PDGH LQ WKLV GLUHFWLRQ ZRXOG VWURQJO\ LQFUHDVH WKH LQWHUDFWLYLW\DQGG\QDPLFVLQWKHXVDJHRIWKHVRIWZDUH)LJXUHJLYHVDQRYHUYLHZRI %* DQG KLJKOLJKWV SRLQWV WKDW FRXOG EHQHILW IURP KLJKWKURXJKSXW FRPSXWLQJ WHFKQLTXHV
)LJXUH %* RYHUYLHZ 7KH ILJXUH VKRZV VFKHPDWLFDOO\ WKH DUFKLWHFWXUH RI %* 8VHG V\PEROV DUH GHVFULEHGLQWKHHPEHGGHGOHJHQG7KHILJXUHKDVWREHLQWHUSUHWHGIURPWKHOHIWWRWKHULJKWDQGUHSUHVHQWD W\SLFDOUXQRIWKHDSSOLFDWLRQ'DUNHUER[HVKLJKOLJKWSODQHG*ULGPRGXOHV
+LJK3HUIRUPDQFHDQG*ULG&RPSXWLQJRQ*HQHWLFV 0DQ\SURMHFWVDUHFXUUHQWO\ZRUNLQJRQDGYDQFHGFRPSXWLQJDSSOLHGWRJHQRPLFV UHVHDUFK DQG 'DWD 0LQLQJ 7KHVH SURMHFWV RIWHQ XVH *ULG DQG 3DUDOOHO &RPSXWLQJ DOWKRXJKLWLVXQXVXDOWRILQGERWKDSSURDFKHVFRPELQHG7KLVVHFWLRQLQFOXGHVDEULHI UHYLHZRQVRPHRIWKHPRVWUHOHYDQWRQHV 7KH 1DWLRQDO &HQWUH IRU %LRWHFKQRORJ\ ,QIRUPDWLRQ 1&%, SURYLGHV RQH RI WKH PRVW ZLGHO\ XVHG UHSRVLWRU\ RI %ODVW WRROV 1&%, RIIHUV YLD LWV ZHEVLWH ELQDULHV IRU ORFDO%ODVWLQVWDOODWLRQVRQGLIIHUHQWSODWIRUPVDVZHOODVWKHH[WHQVLYHO\XVHG4%ODVWD ZHELQWHUIDFHWRH[HFXWHUHPRWHEODVWVHDUFKHVDJDLQVWWKH1&%,FOXVWHU,QWKHFDVHRI %*ERWKDSSURDFKHVODFNWKHGHVLUHGSURFHVVLQJVSHHG7KHUHPRWH4%ODVWXVDJHLV OLPLWHGVLQFHLWLVDZRUOGZLGHVKDUHGUHVRXUFHDQGORFDOLQVWDOODWLRQVDUHERXQGWRWKH VLWHOLPLWDWLRQV $QRWKHU VROXWLRQ SURSRVHG E\ 03,%/$67 >@ LV WZR SDUDOOHOL]H %ODVW SURYLGHG WKDW DQ LQVWLWXWLRQ KDV VXIILFLHQW UHVRXUFHV 03,%/$67 LV D IUHHO\ DYDLODEOH RSHQ VRXUFH SDUDOOHOL]DWLRQ RI WKH DFFHSWHG 1&%, %ODVW 03,%/$67 VHJPHQWV WKH %ODVW GDWDEDVHDQGGLVWULEXWHVLWDORQJFOXVWHUQRGHVHQDEOLQJ%ODVWTXHULHVWREHSURFHVVHG VLPXOWDQHRXVO\ RQ PDQ\ QRGHV 3DUWLDO UHVXOWV DUH IXVHG DIWHU HDFK UXQ UHFDOFXODWLQJ VWDWLVWLFDOYDOXHVDQGJHQHUDWLQJ1&%,%ODVWOLNHRXWSXW03,%/$67LVEDVHGRQ03, DQGUXQV XQGHU/LQX[:LQGRZVDQG VHYHUDO IODYRXUVRI 81,;03,%/$67UHGXFHV WKH FRPSXWLQJ WLPH IRU DQ LQGLYLGXDO VHDUFK DQG IRUPV SDUW RI WKH KHUH SURSRVHG
198
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
VROXWLRQ DV EDVLF FRPSXWLQJ NHUQHO +RZHYHU IRU WKH VDNH RI VFDODELOLW\ KLJKHUOHYHO DSSURDFKHV PXVW EH FRQVLGHUHG LI WKH UHTXLUHPHQW IRU FRPSXWLQJ QRGHV WR EH XVHG H[FHHGVWKHERXQGVRIDVLQJOHLQVWLWXWLRQ :KHQ FRPSXWLQJ QHHGV JR EH\RQG WKH LQVWLWXWLRQ FDSDELOLWLHV WKH XVH RI *ULG &RPSXWLQJKDVUHYHDOHGWREHDVXFFHVVIXODSSURDFK7KHUHDUHJOREDOVROXWLRQVVXFK DV LQ P\*ULG >@ D 8. H6FLHQFH SURMHFW IXQGHG E\ WKH (365& LQYROYLQJ ILYH 8. XQLYHUVLWLHV WKH (XURSHDQ %LRLQIRUPDWLFV ,QVWLWXWH (%, DQG PDQ\ LQGXVWULDO FROODERUDWRUV7KHP\*ULGSURMHFWHPSKDVL]HVRQWKH,QIRUPDWLRQ*ULGDQGLVEXLOGLQJ KLJKOHYHO VHUYLFHV IRU GDWD DQG DSSOLFDWLRQ UHVRXUFH LQWHJUDWLRQ VXFK DV UHVRXUFH GLVFRYHU\ZRUNIORZHQKDQFHPHQWDQGGLVWULEXWHGTXHU\SURFHVVLQJ $GGLWLRQDOO\ WKHUH DUH VHYHUDO *ULGHQDEOHG YHUVLRQV RI %ODVW VXFK DV WKH RQH GHYHORSHGE\DFROODERUDWLRQRIWKH1&6$DQGWKH8QLYHUVLW\RI,OOLQRLV>@WKH*(% GHYHORSHGLQWKH,15,$>@DQGRWKHUVLPLODUGHYHORSPHQWV7KHDUFKLWHFWXUHRIWKRVH V\VWHPVLVTXLWHVLPLODUWKH\VSOLWERWKWKHLQSXWWKURXJKRXWDJULGDQGWKHGDWDEDVHVDV LQ WKH 03,%/$67 DSSURDFK +RZHYHU WKRVH DSSURDFKHV DUH HLWKHU DSSOLFDWLRQV E\ WKHPVHOYHV RU FRPPDQGOLQH WRROV PDLQO\ IRFXVHG WR GHDO ZLWK LQWUDRUJDQLVDWLRQ UHVRXUFHVUDWKHUWKDQDQKHWHURJHQHRXVODUJH*ULGQHWZRUN )RFXVHG RQ H[SORLWLQJ WKH UHVRXUFHV DW ODUJH VFDOH LQ D *ULG LQIUDVWUXFWXUH WKH :,6'20 SURMHFW >@ DLPV WR GHPRQVWUDWH WKH UHOHYDQFH DQG WKH LPSDFW RI WKH *ULG DSSURDFK WR DGGUHVV GUXJ GLVFRYHU\ IRU QHJOHFWHG GLVHDVHV 7KLV ILUVW ELRPHGLFDO GDWD FKDOOHQJHLQWKH(*((LQIUDVWUXFWXUHLVDVFDODELOLW\VWHSWRZDUGVDIXOOLQVLOLFRQGUXJ GLVFRYHU\SODWIRUP7KHFRPSXWLQJDSSURDFKRIWKLVSURMHFWKDVPDQDJHGWRDFKLHYHWKH SURGXFWLYLW\RIWHQVRI &38\HDUVLQMXVWRQH PRQWK:,6'20KRZHYHUODFNVRQDQ HDV\WRXVHLQWHUIDFHDQGLVQRWDJHQHUDOSXUSRVHWRRO &RQFHUQLQJXVDELOLW\WKHUHLVDFOHDUQHHGWRHDVHWKHLQWHUIDFLQJEHWZHHQWKH*ULG DQGWKHXVHUV,QWKLVFRQWH[WWKH*ULG3URWHLQ6HTXHQFH#QDO\VLV>@*36# LVDQ LQWHJUDWHG *ULG SRUWDO GHYRWHG WR PROHFXODU ELRLQIRUPDWLFV *36# XVH D *ULG FRPSXWLQJ LQIUDVWUXFWXUH SURYLGHG E\ WKH (*(( (XURSHDQ SURMHFW WR ILQG D YLDEOH VROXWLRQ WR GLVWULEXWH GDWD DOJRULWKPV FRPSXWLQJ DQG VWRUDJH UHVRXUFHV IRU JHQRPLF UHVHDUFK 7KH FXUUHQW YHUVLRQ LV XQGHU GHYHORSPHQW DQG GHSOR\HG RQ D +LJK (QHUJ\ 3K\VLFV*ULG0LGGOHZDUH/+&&RPSXWLQJ*ULG>@/&* *36#LVDPLJUDWLRQRI WKH1HWZRUN3URWHLQ6HTXHQFH$QDO\VLV136$ VHUYLFHVRQWRWKH(*((JULG136$LV D SURGXFWLRQ ZHE SRUWDO KRVWLQJ SURWHLQV GDWDEDVHV DQG DOJRULWKPV IRU VHTXHQFH DQDO\VLV$OWKRXJKZHFDQXVH*36#WRVHDUFKSURWHLQVHTXHQFHVLPLODULWLHVE\XVLQJ %ODVW YLD *ULG LW LV QRW SRVVLEOH WR PRGLI\ DOJRULWKP SDUDPHWHUV QHLWKHU WR HDVLO\ VXEPLW ODUJH EORFNV RI GDWD )XUWKHUPRUH ZHE LQWHUIDFLQJ GRHV QRW ILW WKH QHHG RI D GHYHORSPHQWHQYLURQPHQW 7KHUHIRUHRXUDSSURDFKLVWRGHDOZLWKDODUJHKHWHURJHQHRXVQHWZRUNRIUHVRXUFHV DVLQWKH:,6'20SURMHFW VSOLWWLQJWKHZRUNLQDVLPLODUDSSURDFKDV*5,'%/$67 DQG XVLQJ 03,%/$67 IRU WKH SDUDOOHOL]DWLRQ RI VLQJOH VHDUFKHV 7KH LQWHUIDFH LV SURYLGHG E\ WKH :HE 6HUYLFHV 5HVRXUFH )UDPHZRUN >@ :65) EDVHG RQ VWDQGDUG :HE6HUYLFHVSURWRFROVWRHDVHWKHLQWHJUDWLRQWR%* $5&+,7(&785( 7KHVRIWZDUHDUFKLWHFWXUHSURSRVHGKHUHDLPVDWEULGJLQJELRPHGLFDOGDWDPLQLQJ DSSOLFDWLRQV DQG *ULG 7HFKQRORJLHV WR VROYH SUREOHPV WKDW UHTXLUH ODUJH DPRXQW RI UHVRXUFHV ERWK FRPSXWLQJ DQG VWRUDJH 7KLV VHFWLRQ GHVFULEHV WKH GLIIHUHQW OD\HUV LQ
199
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
ZKLFKLWLVVWUXFWXUHGDQGWKHFRPSRQHQWVLQWHJUDWHGWRWKHP7KLVDUFKLWHFWXUHLVEDVHG LQ RWKHU *ULGLQWHUIDFLQJ ZRUN >@ 6SHFLDO DWWHQWLRQ ZLOO EH SDLG WR WKH FRPSRQHQWV QHHGHG IRU WKH LQWHJUDWLRQ RI 03,%/$67 ZKLFK ZLOO EH WKH SURFHVVLQJ HQJLQH H[HFXWLQJ ODUJH VHW RI VHTXHQFHV IURP WKH %* LQWHUIDFH UHFHLYHG WKURXJK WKH LPSOHPHQWHG*ULG6HUYLFH 7KHDUFKLWHFWXUHFRQVLVWVRIIRXUOD\HUV7ZRORZHUOD\HUVGLUHFWO\LQWHUDFWZLWKWKH *ULG DQG WZR KLJKHU OD\HUV SURYLGH DQ DEVWUDFW DQG XQLIRUP LQWHUIDFH WR WKH %* DSSOLFDWLRQ7KHOD\HUVSURSRVHGIURPKLJKHUWRORZHU DUHD $SSOLFDWLRQ /D\HUE *ULG /D\HU F *DWHWR*ULG /D\HU G &RPSRQHQWV 0LGGOHZDUH /D\HU *ULG /D\HU )LJXUHVKRZVDVFKHPDRIWKHGLIIHUHQWOD\HUVDQGWKHLULQWHUDFWLRQV
$SSOLFDWLRQ/D\HU
&RPSRQHQWV 0LGGOH:DUH/D\HU
1
$SSOLFDWLRQ%ODVW*R
&B*5,'B03,%ODVW
2WKHU&RPSRQHQW
SUR[\
SUR[\
1
8VHU
2WKHU$SSOLFDWLRQV
1 *DWHWR*ULG /D\HU
*ULG/D\HU
+7736 &HUWLILFDWHV
+7736 &HUWLILFDWHV
SUR[\
:65)03,%ODVW
SUR[\
2WKHU:65)
8VHU,QWHUIDFH/&*
:HE6HUYLFHV5HVRXUFH )UDPHZRUN:65) &RQWDLQHU
)LJXUH*HQHUDOYLHZRIWKHDUFKLWHFWXUHDQGWKH03,%/$67FRPSRQHQWV
1H[W VHFWLRQV GHVFULEH WKH GLIIHUHQW OD\HUV LQ GHWDLO LQGLFDWLQJ WKH GHILQHG UHTXLUHPHQWVVXFKDVSURWRFROVVWUXFWXUHVGDWDDQGLQWHUIDFHGHILQLWLRQV $SSOLFDWLRQ/D\HU 7KLV OD\HU FRPSULVHV WKH H[HFXWLRQ PRGXOHV WKDW SURYLGH XVHULQWHUIDFH DSSOLFDWLRQVZLWKWKHDLPWRDFFHVVWKHUHVRXUFHVDQGWKHXVHULQWHUIDFHDSSOLFDWLRQVE\ WKHPVHOYHV 7KH DUFKLWHFWXUH SURSRVHG LQ WKLV ZRUN ZRXOG EH FRPSDWLEOH WR GLIIHUHQW XVHULQWHUIDFH DSSOLFDWLRQV 7KH FRPSRQHQWV WR EH GHYHORSHG LQ WKLV OD\HU KDYH WKH REMHFWLYH RI DEVWUDFWLQJ WKH DFFHVV WR WKH UHVRXUFHV RI WKH *ULG ERWK VRIWZDUH DQG KDUGZDUH *HQHUDOO\ DSSOLFDWLRQV DUH QRW VXSSRVHG WR EH GHHSO\ FKDQJHG ZKHQ FRQQHFWHGWR*ULG,QWKHFDVHRI%*WKHDSSOLFDWLRQSHUIRUPVRQH%ODVWUHTXHVWSHU VHTXHQFHZKLFKZRXOGEHLQDGHTXDWHLQD*ULGHQYLURQPHQWFRQVLGHULQJWKHODWHQFLHV DQG *ULG ZRUNORDG :KHQ WKH *ULG HQYLURQPHQW LV XVHG IRU EODVWLQJ WKH DSSOLFDWLRQ OD\HU LQ %* ZLOO FRPSLOH UHTXHVWV DQG EXLOG KLJKOHYHO MREV RI WKH DGHTXDWH JUDQXODULW\WRDFKLHYHWKHGHVLUHGHIILFLHQF\ 7KHFRPSRQHQWDQDO\VHVWKHQXPEHURIVHTXHQFHVWREHSURFHVVHGDQGWKHQXPEHU RI DYDLODEOH FRPSDWLEOH UHVRXUFHV &RQVLGHULQJ WKRVH IDFWV LW FROOHFWV LQGLYLGXDO UHTXHVWV XS WR D UHDVRQDEOH SDFNHW VL]H DQG GLVWULEXWHV LW DORQJ WKH GLIIHUHQW 03,%/$67UHVRXUFHV5HVXOWVDUHFROOHFWHGSURJUHVVLYHO\DQGDUHGLUHFWO\UHWXUQHGWR WKHDSSOLFDWLRQWRPDLQWDLQWKHXVHULQIRUPHGRIWKHDQDO\VLVSURJUHVV
200
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
&RPSRQHQWV0LGGOHZDUH/D\HU 7KH FRPSRQHQWV RI WKLV OD\HU KDYH WZR LQWHUIDFHV RQH IRU LQWHUDFWLQJ ZLWK WKH $SSOLFDWLRQV/D\HUDQGRWKHUIRULQWHUDFWLQJZLWKWKHUHVRXUFHVGHILQHGLQWKH*DWHWR *ULG /D\HU7KHDSSOLFDWLRQLQWHUIDFHLV LPSOHPHQWHGE\REMHFWRULHQWHGFRPSRQHQWV 7KHLQWHUIDFHWRWKHUHVRXUFHVXVHVSUR[LHVWKDWHQDEOHWKHFRPSRQHQWVWRLQWHUDFWZLWK WKHUHVRXUFHVRQEHKDOIRIWKHXVHU7KLVLQWHUIDFHLVLPSOHPHQWHGWKURXJK:65)*ULG 6HUYLFHV 7KH GDWD H[FKDQJHG ZLWK WKH *ULG 6HUYLFHV DUH FRGHG XVLQJ WKH H;WHQVLEOH 0DUNXS /DQJXDJH >@ ;0/ 7KH FRPPXQLFDWLRQ SURWRFRO LV WKH 6LPSOH 2EMHFW $FFHVV3URWRFRO>@62$3 DERYH+7736WRSUHVHUYHVHFXULW\RQWKHFRQQHFWLRQV )RU WKH LQWHUDFWLRQ ZLWK WKH 03,%/$67 D FRPSRQHQW FDOOHG &B*5,'B03,%/$67LVGHILQHGZKLFKZLOOEHLQFKDUJHRIUHTXHVWLQJWKHSURFHVVLQJ UHWULHYLQJ WKH UHVXOWV DQG FRQVXOWLQJ WKH VWDWHV LQ UHDO WLPH 7KLV FRPSRQHQW SURYLGHV WKH DSSOLFDWLRQ ZLWK D FODVV IRU JHQHUDWLQJ WKH REMHFWV WKDW DUH XVHG GLUHFWO\ E\ WKH DSSOLFDWLRQ %* )XUWKHUPRUH WKLV FRPSRQHQW LQWHUDFWV ZLWK WKH *ULG 6HUYLFH :65)B03,%/$67 IRU WKH H[HFXWLRQ RI WKH SDUDOOHO WDVNV XVLQJ LWV FRUUHVSRQGLQJ SUR[\ 7KHPDLQPHWKRGVGHILQHGLQWKHFRPSRQHQW&B*5,'B03,%/$67IRULQWHUDFWLQJ ZLWKWKHDSSOLFDWLRQ%*DUHWKHIROORZLQJ 6HQGV DQ H[HFXWLRQ UHTXHVW ZLWK D VHW RI LQSXW VHTXHQFHV XVLQJ L6WDUW%/$67 GLIIHUHQWGDWDEDVHV +DVK7DEOHRILQSXWVHTXHQFHV ,QSXW ,QSXW$UJXPHQWVGDWDEDVHVQDPH%ODVWKLWVHWF« 7KHVWDWXVRIWKHH[HFXWLRQ 5HWXUQ 6WRSV DQ H[HFXWLRQ VWDUWHG E\ L6WDUW%/$67 DQG FDQFHOV DOO MREV L6WRS%/$67 DVVRFLDWHG 6XFFHVVIXOVWRSRUHUURU 5HWXUQ 5HWXUQ WKH QXPEHU RI VHTXHQFHV SURFHVVHG DQG WKH DYDLODEOH L*HW6WDWXV UHVXOWV 1XPEHURIUHVXOWVILQLVKHGDQGUHWULHYHG 5HWXUQ Y*HW)LQLVKHG5H 5HWXUQD+DVKWDEOHZLWKWKHUHVXOWVILQLVKHGEXWQRW\HWUHWULHYHG VXOWV +DK7DEOHZLWKWKHUHVXOWV 5HWXUQ *DWHWR*ULG/D\HU 7KLVOD\HULVDOVRGLYLGHGLQWRWZRLQWHUIDFHV7KHILUVWLQWHUIDFHLQWHUDFWVZLWKWKH FRPSRQHQWV RI WKH &RPSRQHQWV 0LGGOHZDUH /D\HU XVLQJ SUR[LHV DQG WKH LQWHUIDFH RIIHUHGE\WKH*ULG6HUYLFH7KLVLQWHUIDFHLVGHILQHGXVLQJWKH:HE6HUYLFH'HILQLWLRQ /DQJXDJH >@ :6'/ 7KH VHFRQG LQWHUIDFH LQWHUDFWV GLUHFWO\ ZLWK WKH UHVRXUFHV FOXVWHURIFRPSXWHUVGDWDEDVHVZRUNORDGPDQDJHUVHWF« &RQFHUQLQJWKH03,%/$67FRPSRQHQWWKHVHOHFWHGVROXWLRQKHUHLVWKHXVDJHRI WKH (*(( LQIUDVWUXFWXUH 7KLV LQIUDVWUXFWXUH FXUUHQWO\ UHDFKHV PRUH WKDQ FRQQHFWHG FRPSXWHUV 7KH DSSOLFDWLRQ FRQVLGHUV WKH (*(( GHSOR\PHQW DV D UHVRXUFH
201
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
DQG ZLOOEHLPSOHPHQWHGDVD:65)*ULG6HUYLFHRIIHULQJDQLQWHUIDFHFRPSRVHGE\ WKHIROORZLQJPHWKRGV L,QLW6HVVLRQ ,QSXW 5HWXUQ L/DXQFK%/$67 ,QSXW
5HWXUQ L*HW6WDWXV
6HVVLRQLQLWLDOLVDWLRQLQWKH(*((*ULG(QYLURQPHQW 8VHULGHQWLILHU 3DVVZRUGRIWKHXVHU 5HWXUQWKH6HVVLRQ,GHQWLILHU 6XEPLWVD-RELQGLIIHUHQWWDVNVLQWKH(*((HQYLURQPHQW ,GHQWLILHURIVHVVLRQ ,QSXWVHTXHQFHVLQDQ;0/GRFXPHQW 'DWD SDUDPHWHUV GDWD EDVHV QDPH %ODVWKLWV HWF« LQ ;0/ 7KHVWDWXVRIWKHVXEPLVVLRQ 5HWXUQV WKH QXPEHU RI VHTXHQFHV SURFHVVHG DQG WKH DYDLODEOHUHVXOWV 1XPEHURIUHVXOWVILQLVKHGDQGREWDLQHG
5HWXUQ [PO*HW)LQLVKHG5HVXOWV 5HWXUQVDQ;0/GRFXPHQWZLWKWKHUHVXOWVILQLVKHGDQGQRW \HWUHWULHYHGIRUDJLYHQVHVVLRQ ,QSXW ,GHQWLILHURIWKHVHVVLRQ 5HWXUQ ;0/GRFXPHQWZLWKWKHUHVXOW
7KHIXQFWLRQDOLW\RIWKLVUHVRXUFHLVWRJHWWKHLQSXWVHTXHQFHVDQGVSOLWWKHPLQWR GLIIHUHQW MREV 'HSHQGLQJ RQ WKH FRPSXWDWLRQDO UHVRXUFHV DYDLODEOH LQ WKH (*(( GHSOR\PHQW HDFK MRE ZLOO KDYH D QXPEHU RI LQSXW VHTXHQFHV DQG ZLOO XVH WKH 03,%/$67 DOJRULWKP WR SURFHVV WKHP 7KLV LQIRUPDWLRQ LV SURYLGHG E\ WKH LQIRUPDWLRQV\VWHPVRI(*((EHLQJWKHQXPEHURISURFHVVLQJQRGHV:RUNHUV1RGHV DQGWKHQXPEHURISURFHVVRUVWKHPDLQLQIRUPDWLRQ
)LJXUH*HQHUDOYLHZRIWKH:65)*ULG6HUYLFHVFRPSRQHQWLQWKH*DWHWR*ULG/D\HU
202
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
*ULG/D\HU 7KLVOD\HUFRUUHVSRQGVZLWKWKHORZHVWOHYHORIWKHDUFKLWHFWXUH,WGHDOVLQJHQHUDO ZLWKWKHXVHRIWKH*ULGLQIUDVWUXFWXUH,QWKHFDVHRI03,%/$67LWLVQHFHVVDU\WRXVH D*ULG0LGGOHZDUHWKDWZLOOEHHIILFLHQWLQGHDOLQJZLWKKLJKSURGXFWLYLW\SURFHVVLQJ VLQFHWKHLQSXWVHTXHQFHVLQWKH03,%/$67SURFHVVDUHGLYLGHGLQGLIIHUHQWMREVWKDW ZLOOEHH[HFXWHGFRQFXUUHQWO\RQGLIIHUHQWGDWDEDVHV 7KH*ULG0LGGOHZDUHXVHGLQWKLVZRUNLV/&*DOWKRXJKPLJUDWLRQWRJ/LWH >@ ZLOO EH SHUIRUPHG ZKHQ DYDLODEOH WKLV ZLOO UHVXOW LQ D MRLQHG GLVWULEXWLRQ RI /&* DQG J/LWH %RWK PLGGOHZDUHV RIIHU WKH ³VLQJOH FRPSXWHU´ YLVLRQ RI WKH *ULG WKRXJKW WKH VWRUDJH FDWDORJXHV DQG ZRUNORDG PDQDJHPHQW VHUYLFHV WKDW WDFNOH ZLWKWKHSUREOHPRIVHOHFWLQJWKHULJKWPRVWUHVRXUFH/&**ULGPLGGOHZDUHFRPSULVH WKHIROORZLQJHOHPHQWV • ,6%',,,QIRUPDWLRQ6HUYLFH±%HUNHOH\'DWDEDVH,QIRUPDWLRQ,QGH[WKLVHOHPHQW SURYLGHVWKHLQIRUPDWLRQDERXWWKH*ULGUHVRXUFHVDQGWKHLUVWDWXV • &$&HUWLILFDWLRQ$XWKRULW\WKH&$VLJQVWKHFHUWLILFDWHVIURPERWKUHVRXUFHVDQG XVHUV • &( &RPSXWLQJ (OHPHQW LW LV GHILQHG DV D TXHXH RI *ULG MREV $ FRPSXWLQJ HOHPHQWLVDIDUPRIKRPRJHQHRXVFRPSXWLQJQRGHVFDOOHG:RUNHU1RGHV • :1:RUNHU1RGHLWLVDFRPSXWHULQFKDUJHRIH[HFXWLQJMREV • 6(6WRUDJHHOHPHQWD6(LVDVWRUDJHUHVRXUFHLQZKLFKDWDVNFDQVWRUHGDWDWREH XVHGE\WKHFRPSXWHUVRIWKH*ULG • 5&5/65HSOLFD&DWDORJXH5HSOLFD/RFDWLRQ6HUYLFHWKH\DUHWKHHOHPHQWVWKDW PDQDJHWKHORFDWLRQRIWKH*ULGGDWD • 5%5HVRXUFH%URNHUWKH5%SHUIRUPVWKHORDGEDODQFLQJRIWKHMREVLQWKH*ULG GHFLGLQJLQZKLFK&(VWKHMREVZLOOEHODXQFKHG • 8,8VHULQWHUIDFHWKLVFRPSRQHQWLVWKHHQWU\SRLQWRIWKHXVHUVWRWKH*ULGDQG SURYLGHVDVHWRIFRPPDQGVDQG$3,VWKDWFDQEHXVHGE\WKHSURJUDPVWRSHUIRUP GLIIHUHQWDFWLRQVRQWKH*ULG • 7KH-RE'HVFULSWLRQ/DQJXDJH-'/ LWLVWKHZD\LQZKLFKMREVDUHGHVFULEHG$ -'/ LV D WH[W ILOH VSHFLI\LQJ WKH H[HFXWDEOH WKH SURJUDP SDUDPHWHUV WKH ILOHV LQYROYHGLQWKHSURFHVVLQJDQGRWKHUDGGLWLRQDOUHTXLUHPHQWV 6HFXULW\ $OWKRXJKVHTXHQFHGDWDPLJKWQRWEHVRVHQVLWLYHWRSULYDF\DVPHGLFDOUHFRUGVLW LVQRWLQIUHTXHQWWKDWQXFOHLFDFLGRUSURWHLQVHTXHQFHVPD\KDYHSDWHQWSRWHQWLDORUEH UHODWHGWRSURMHFWVZKHUHFRQILGHQWLDOLW\KDVWREHSUHVHUYHG)RUWKLVFDVHWKHVROXWLRQ RIWKLVZRUNXVHVVHFXUHFKDQQHOVUHTXHVWLQJHQFU\SWLRQ 5HJDUGLQJWKHDFFHVVWRWKHV\VWHPWKHGLIIHUHQWOD\HUVRIWKHDUFKLWHFWXUHGHILQHG KDYH GLIIHUHQW DSSURDFKHV LQ WKH LPSOHPHQWDWLRQ RI WKH VHFXULW\ %DVLFDOO\ LW FDQ EH GLYLGHGLQWRWZRSDUWV2QHSDUWLVUHODWHGWRWKH:65)HQYLURQPHQWDQGRWKHUSDUWZLOO GHDOZLWKWKH(*((*ULGHQYLURQPHQW,QERWKFDVHVVHFXUHSURWRFROVDUHXVHGIRUWKH FRPPXQLFDWLRQ ,QUHODWLRQWRWKHVFRSHRIVHFXULW\RI:65)VWKHDUFKLWHFWXUHGHILQHVDFOLHQWOD\HU PLGGOHZDUH FRPSRQHQWV IRU LQWHUDFWLQJ ZLWK WKH V\VWHP )RU :65)V WKH 62$3 SURWRFRO LV XVHG RQ WRS +7736 ZKLFK LV EDVHG LQ 6HFXUH 6RFNHWV /D\HU >@ 66/
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
203
7KH +7736 SURWRFRO JXDUDQWHHV WKH SULYDF\ RI WKH GDWD DQG WKH XVH RI GLJLWDO FHUWLILFDWHVJXDUDQWHHDXWKHQWLFDWLRQRIXVHU 7KH *ULG PLGGOHZDUH XVHG SURYLGHV D QDWLYH LQIUDVWUXFWXUH RI VHFXULW\ QDPHO\ *ULG 6HFXULW\ ,QIUDVWUXFWXUH *6, >@ *6, LV DOVR EDVHG RQ 66/ %HIRUH DFFHVVLQJ DQ\ UHVRXUFH RI WKH *ULG D SUR[\ PXVW EH FUHDWHG IURP WKH FOLHQW FHUWLILFDWH 7KH FHUWLILFDWHLVGXO\VLJQHGE\WKH&HUWLILFDWH$XWKRULW\&$ (DFKUHVRXUFHRIWKH*ULG HVWDEOLVKHV D PDSSLQJ RI WKH 'LVWLQJXLVK 1DPH '1 REWDLQHG IURP WKH SUR[\ $V D UHVXOWHDFKGHSOR\HGUHVRXUFHLQWKH*ULGLVFHUWLILHGE\DYDOLG&$ ,Q WKH GHVFULEHG DUFKLWHFWXUH WKH *DWHWR*ULG LV WKH FRPPRQ SRLQW EHWZHHQ WKH :65)DQG(*((*ULGHQYLURQPHQW$PDSSLQJV\VWHPRIWKH:65)XVHUVWKURXJK :HE XVHU FHUWLILFDWH KDV EHHQ LPSOHPHQWHG LQ WKH *DWHWR*ULG OD\HU WKDW DVVRFLDWHV WKH:HEXVHUVZLWKWKH(*((*ULGXVHUV)RUHDFKXVHUD*ULGSUR[\LVFUHDWHGIURP LWV*ULGXVHUFHUWLILFDWH
)LJXUH6HFXULW\VFKHPDRIWKHSURSRVHGDUFKLWHFWXUH
(;3(&7('5(68/76$1'%(1(),76 7KH PRVW GLUHFW H[SHFWHG UHVXOW IURP WKLV SURMHFW ZLOO EH WKH FUHDWLRQ RI WKH UHTXLUHGDUFKLWHFWXUHIRUODXQFKLQJ%ODVWSURFHVVHVIURPWKH%ODV*2DSSOLFDWLRQLQWRD *ULGV\VWHP$VVXFKVSHHGLQJXSWKH%ODVWSURFHVVIRUWKRXVDQGRIVHTXHQFHVKDVWKH FOHDUEHQHILWRIDQLQFUHDVHGSHUIRUPDQFHDQGWLPHJDLQ$GGLWLRQDOO\WKHDYDLODELOLW\ RIDIDVWGHOLYHULQJ%ODVWVHUYLFHFDQKDYHIXUWKHULPSOLFDWLRQVRQWKHZD\WKLVDQDO\VLV SURFHGXUH FDQ EH XVHG 7\SLFDOO\ DQQRWDWLRQ SURMHFWV UXQ DW DQ HDUO\ VWDJH RQH %ODVW VHDUFK DJDLQVW RQH RU D IHZ VHOHFWHG GDWDEDVHV DQG %ODVW XSGDWHV DUH QRW IUHTXHQW +RZHYHU SXEOLF GDWDEDVHV DUH FRQVWDQW DQG UDSLGO\ LQFUHDVLQJ LQ QXPEHU DQG VL]H WKURXJKWKHFRQWULEXWLRQVRISDUWLFXODUJHQHFKDUDFWHUL]DWLRQVDQGPDVVLYHVHTXHQFLQJ SURMHFWV 7KH HDVLQHVV IRU %ODVW XSGDWHV WKDW D JULG HQYLURQPHQW FRXOG RIIHU ZRXOG RSWLPL]H WKH H[SORLWDWLRQ RI QHZ VHTXHQFH GDWD QRW RQO\ IRU %* EXW IRU VLPLODU JHQRPLFDSSOLFDWLRQVRUH[SHULPHQWVWKDWXVH%ODVWUHVXOWV $QRWKHU DVSHFW WKDW ZRXOG EH HQKDQFHG LV WKH PXOWL'% TXHU\LQJ DV %ODVW VHDUFKHVLQGLIIHUHQWGDWDEDVHVFRXOGEHHDVLO\GLVWULEXWHGLQGLIIHUHQW&38VDQGUXQLQ SDUDOOHO$OOWKHVHLPSURYHPHQWVLQ%ODVWSHUIRUPDQFHPHDQWKDWH[SHULPHQWDWLRQZLWK %ODVWSDUDPHWHUV ZRXOG EHFRPH IHDVLEOH HJ %ODVW*2 VWXGLHV RQ WKH *2 WHUP DQQRWDWLRQHIIHFWLYHQHVVLQUHODWLRQWR%ODVWSDUDPHWHUVFRXOGEHHDVLO\FDUULHGRXW 0RUHRYHU %ODVW LV QRW WKH RQO\ WLPHFRQVXPLQJ SURFHVV RI WKH %ODVW*2 WRRO RWKHU IXQFWLRQV VXFK DV PDSSLQJ *2 WHUP VHOHFWLRQ DQQRWDWLRQ HQKDQFHPHQW DQG *2VOLP SURMHFWLRQV DUH DOVR VHTXHQFHZLVH SURFHVVHV WKDW FDQ EHFRPH KLJKO\ &38 FRQVXPLQJ DQG WKDW FRXOG JUHDWO\ EHQHILW IURP D *ULG DUFKLWHFWXUH 6LPLODUO\ RWKHU
204
G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype
LQWHQVLYH DQQRWDWLRQ VWUDWHJLHV VXFK DV 3IDP >@ DQG ,QWHU3UR >@ FRXOG EH DOVR LQFRUSRUDWHGLQWRWKHWRRO /DVW EXW QRW OHDVW D YHU\ LQWHUHVWLQJ SRWHQWLDO EHQHILW RI LPSOHPHQWLQJ *ULG WHFKQRORJ\LQDIXQFWLRQDOJHQRPLFVWRROVXFKDV%ODVW*2UHODWHVWRLWVIXQFWLRQDOLW\ IRU DQQRWDWLRQEDVHG GDWDPLQLQJ RI H[SHULPHQWDO JHQRPLFV GDWD &XUUHQW DSSURDFKHV DUH PRVWO\ EDVHG RQ XQLYDULDWH VWDWLVWLFDO PHWKRGV 2WKHU YHU\ LQWHUHVWLQJ PHWKRGRORJLHV LQ WKLV DUHD DUH PXOWLYDULDWH PHWKRGV FRXSOHG WR VWRFKDVWLF VHDUFKHV ZKLFK DUH FRQFHSWXDOO\ VXVFHSWLEOH WR GLYLVLRQ *ULG WHFKQRORJ\ ZRXOG RIIHU WKHQ WKH SRVVLELOLW\ WR UHDOL]H WKLV GLVWULEXWLRQ RI WKH VHDUFK SURFHVV DQG WKHUHIRUH WR UHGXFH FRPSXWLQJWLPH )LQDOO\WKHUHDUHDOVRLQLWLDWLYHVVXFKDV%LRLQIRUPDWLFV*ULG$SSOLFDWLRQIRU/LIH 6FLHQFH %LRLQIRJULG RU 6XSSRUWLQJ DQG VWUXFWXULQJ +HDOWKJULG $FWLYLWLHV DQG 5HVHDUFK LQ (XURSH >@ 6+$5( ZKLFK DUH IRVWHULQJ WKH XVH RI *ULGV LQ WKH OLIH VFLHQFH FRPPXQLW\ DQG ZKLFK FRXOG SOD\ DQ LPSRUWDQW UROH LQ WKH SURPRWLRQ RI WKH XVDJHRI*ULGVLQELRFRPSXWDWLRQRSHQLQJWKHGRRUVWRPRUHFROODERUDWLRQV 5()(5(1&(6 >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@
&RQHVD$ *|W]6 *DUFtD*yPH]-0 7HURO- 7DOyQ0 DQG 5REOHV0 %ODVW*2 $ 8QLYHUVDO 7RRO IRU $QQRWDWLRQ 9LVXDOL]DWLRQ DQG $QDO\VLV LQ )XQFWLRQDO *HQRPLFV 5HVHDUFK %LRLQIRUPDWLFV $OWVFKXO6) *LVK: 0LOOHU: 0H\HUV(: DQG /LSPDQ'- %DVLF /RFDO $OLJQPHQW 6HDUFK7RRO-RXUQDORI0ROHFXODU%LRORJ\ ³PSL%/$672SHQ6RXUFH3DUDOOHO%ODVW´KWWSPSLEODVWODQOJRY ³0\*ULG3URMHFW´KWWSZZZP\JULGRUJXN ³1&6$*ULG$ZDUH*ULG%ODVW6\VWHP³KWWSELRLQIQFVDXLXFHGXJULGEODVWLQGH[KWPO ³*ULG (QDEOHG %ODVW *(% ZLWK 'LVWULEXWHG 2EMHFWV DQG &RPSRQHQWV´ KWWSZZZ VRSLQULDIURDVLV6WDJHV%LR3UR$FWLYH&DURPHOKWPO ³:LGH,Q6LOLFR'RFNLQJ2Q0DODULD³KWWSZLVGRPKHDOWKJULGRUJ ³*ULG 3URWHLQ 6HTXHQFH #QDO\VLV %LRLQIRUPDWLFV :HE 3RUWDO 'HGLFDWHG WR 3URWHLQ 6HTXHQFH $QDO\VLVRQWKH*5,'´KWWSJSVDLEFSIU ³:RUOG:LGH:HE&RPSXWLQJ*ULG'LVWULEXWHG3URGXFWLRQ(QYLURQPHQWRI3K\VLFV'DWD3URFHVVLQJ ³KWWSOFJZHEFHUQFK/&* ³7KH:65HVRXUFH)UDPHZRUN´KWWSZZZJOREXVRUJZVUI ,JQDFLR %ODQTXHU 9LFHQWH +HUQiQGH] )HUUDQ 0DV 'DPLj 6HJUHOOHV ³$ )UDPHZRUN %DVHG RQ :HE 6HUYLFHV DQG *ULG 7HFKQRORJLHV IRU 0HGLFDO ,PDJH 5HJLVWUDWLRQ´ WK ,QWHUQDWLRQDO 6\PSRVLXP ,6%0'$$YHLUR3RUWXJDO 1RYHPEHU3URFHHGLQJV $OOHQ:\NH5:DWW$³;0/6FKHPD(VVHQWLDOV´:LOH\&RPSXWHU3XE,6%1 ³62$39HUVLRQ´KWWSZZZZRUJ75VRDS ³:HE 6HUYLFHV 'HVFULSWLRQ /DQJXDJH :6'/ :& 1RWH 0DUFK ´ KWWSZZZZRUJ75127(ZVGO ³/LJKW:HLJKW0LGGOHZDUHIRU*ULG&RPSXWLQJ´KWWSJOLWHZHEFHUQFKJOLWH 5ROI2SSOLJHU³6HFXULW\7HFKQRORJLHVIRUWKH:RUOG:LGH:HE´6HFRQGHGLWLRQ&RPSXWHU6HFXULW\ 6HULHV$UWHFK+RXVH,6%1 ³*ULG6HFXULW\,QIUDVWUXFWXUH´KWWSZZZJOREXVRUJVHFXULW\RYHUYLHZKWPO ³6XSSRUWLQJ DQG 6WUXFWXULQJ +HDOWK*ULG $FWLYLWLHV 5HVHDUFK LQ (XURSH´ 6+$5( 7HFKQLFDO $QQH[,67 %DWHPDQ$%LUQH\('XUELQ5(GG\65+RZH./DQG6RQQKDPPHU(/7KH3IDP3URWHLQ )DPLOLHV'DWDEDVH1XFOHLF$FLGV5HV S 7KH ,QWHU3UR &RQVRUWLXP 5$SZHLOHU 7.$WWZRRG $%DLURFK $%DWHPDQ (%LUQH\ 0%LVZDV 3%XFKHU /&HUXWWL )&RUSHW 0'5&URQLQJ 5'XUELQ /)DOTXHW :)OHLVFKPDQQ -*RX]\ ++HUPMDNRE 1+XOR ,-RQDVVHQ '.DKQ $.DQDSLQ <.DUDYLGRSRXORX 5/RSH] %0DU[ 1-0XOGHU702LQQ03DJQL)6HUYDQW&-$6LJULVW(0=GREQRY 7KH,QWHU3UR'DWDEDVHDQ ,QWHJUDWHG 'RFXPHQWDWLRQ 5HVRXUFH IRU 3URWHLQ )DPLOLHV 'RPDLQV DQG )XQFWLRQDO 6LWHV 1XFOHLF $FLGV5HVHDUFKYRO
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
205
Bioprofiling over Grid for eHealthcare L. Sun a, P. Hu a, C. Goh a, B. Hamadicharef a, E. Ifeachor a,1, I.Barbounakis b, M. Zervakis b, N. Nurminen c, A. Varri c, R. Fontanelli d , S. Di Bona d, D. Guerri d, S. La Mannad, K. Cerbioni e, E. Palancae and A. Staritae a School of Computing, Communications and Electronics, University of Plymouth, UK b Telecommunication System Institute, Technical University of Crete, Greece c Institute of Signal Processing, Tampere University of Technology, Finland d Synapsis S.r.l. in Computer Science, Italy e Computer Science Department, University of Pisa, Italy
Abstract. A trend in modern medicine is towards individualization of healthcare and, potentially, grid computing can play an important role in this by allowing sharing of resources and expertise to improve the quality of care. In this paper, we present a new test bed, the BIOPATTERN Grid, which aims to fulfil this role in the long term. The main objectives in this paper are 1) to report the development of the BIOPATTERN Grid, for biopattern analysis and bioprofiling in support of individualization of healthcare. The BIOPATTERN Grid is designed to facilitate secure and seamless sharing of geographically distributed bioprofile databases and to support the analysis of bioprofiles to combat major diseases such as brain diseases and cancer within a major EU project, BIOPATTERN (www.biopattern.org); 2) to illustrate how the BIOPATTERN Grid could be used for biopattern analysis and bioprofiling for early detection of dementia and for brain injury assessment on an individual basis. We highlight important issues that would arise from the mobility of citizens in the EU, such as those associated with access to medical data, ethical and security; and 3) to describe two grid services which aim to integrate BIOPATTERN Grid with existing grid projects on crawling service and remote data acquisition which is necessary to underpin the use of the test bed for biopattern analysis and bioprofiling. Keywords. HealthGrid, Healthcare, Grid computing, Crawling service, Remote data acquisition, Dementia, Brain Injury, Bioprofiling, Biopattern analysis
1. Introduction There is a growing interest in the application of grid computing to healthcare to support data-, computation- and/or knowledge-intensive tasks in areas such as diagnosis, prognosis, disease prediction and drug discovery. Often, this involves the acquisition, analysis and visualisation of biomedical data (medical informatics + bioinformatics). Examples of healthcare applications include distributed mammography data retrieval and processing (e.g. the MammoGrid [1] and eDiaMoND [2] projects), and multicentre neuro-imaging (e.g. BIRN [3]). There is a trend in modern medicine towards individualization of healthcare and, potentially, grid computing can also play a role in 1 Corresponding Author: Professor Emmanuel Ifeachor, School of Computing, Communications and Electronics, University of Plymouth, Plymouth PL4 8AA, UK; E-mail: E.Ifeachor@plymouth.ac.uk.
206
L. Sun et al. / Bioprofiling over Grid for eHealthcare
this by allowing sharing of resources and expertise to improve the quality of care. In this paper, we report efforts to exploit grid computing to support individualization of healthcare to combat major diseases such as brain diseases within a major EU-funded, Network of Excellence (NoE) project, BIOPATTERN (www.biopattern.org). The Grand Vision of the project is to develop a pan-European, coherent and intelligent analysis of a citizen's bioprofile; to make the analysis of this bioprofile remotely accessible to patients and clinicians; and to exploit bioprofiles to combat major diseases such as cancer and brain diseases. A biopattern is the basic information (pattern) that provides clues about underlying clinical evidence for diagnosis and treatment of diseases. Typically, it is derived from specific data types, e.g. genomics and proteomic information and biosignals, such as the electroencephalogram (EEG) and Magnetic Resonance Imaging (MRI). A bioprofile is a personal `fingerprint' that fuses together a person's current and past medical history, biopatterns and prognosis. It combines data, analysis, and predictions of possible susceptibility to diseases. It will drive individualization of care. The project aims to make information from distributed databases available in a secure way over the Internet, and provide on-line algorithms, libraries and processing facilities, e.g. for intelligent remote diagnosis and consultation. Potentially, grid-enabled network can facilitate the seamless sharing and pervasive access to such distributed databases and for online bioprofile analysis and diagnosis. The main objectives in this paper are 1) to report the development of a new Grid test bed, the BIOPATTERN Grid, for biopattern analysis and bioprofiling in support of individualization of healthcare. The BIOPATTERN Grid is designed to facilitate secure and seamless sharing of geographically distributed bioprofile databases and to support analysis of biopatterns and bioprofiles to combat major diseases such as brain diseases and cancer; 2) to illustrate how the BIOPATTERN Grid could be used for bioprofiling for early detection of dementia and for brain injury assessment on an individual basis. We highlight important issues that would arise from the mobility of citizens in the EU, such as those associated with access to medical data, ethical and security; 3) to describe some of the important services that are required to underpin the use of the test bed for biopattern analysis and bioprofiling, including crawling and data acquisition. The remainder of the paper is organised as follows. In Section 2, the BIOPATTERN Grid architecture and prototype are described. In Section 3, two applications of the BIOPATTERN Grid in brain diseases (for dementia and brain injuries) are presented. In Section 4, two specific grid services - crawling and remote data acquisition services are discussed. Section 5 concludes the paper.
2. BIOPATTERN Grid Architecture and Prototype
2.1. BIOPATTERN Grid Architecture The architecture of BIOPATTERN Grid is divided in four layers as shown in Figure 1. The Grid Portal serves as an interface between an end user (e.g. a clinician or a researcher) and the BIOPATTERN Grid. At the client side, an end-user accesses the Grid Portal via a web browser. After user authentication (login/password), the end user can then make use of the services provided by the BIOPATTERN Grid. The Grid Portal sits in a web server with relevant components (e.g. databases and classes) and establishes connections between an end-user and the lower layer grid services.
L. Sun et al. / Bioprofiling over Grid for eHealthcare
207
Grid Portal
(remote) data acquisition
data analysis & visualization
data/information query & crawling
Grid services Grid Middleware
sensor networks
…
databases
algorithms computation Distributed resources
Figure 1. BIOPATTERN Grid Architecture
The grid services layer provides services for data acquisition, data analysis & visualization, and data/information query and crawling. For data acquisition, a patient’s clinical data, electrophysiological data (e.g. EEG), imaging data (e.g. MRI) and bioinformatics data (including biochemical and genomic data) can be either uploaded via a Grid Portal or transferred from remote data acquisition networks. The data is stored in distributed databases. For data analysis, different biodata analysis algorithms are stored in distributed algorithms pools. The analysis algorithms may be used to generate biomarkers to quantify disease severity and to support medical decisionmaking. The results of the analysis are displayed in a user-friendly manner via the Portal. The computational requirements for biodata analysis using complicated algorithms are met by High Performance Computing (HPC) or High Throughput Computing (HTC) resources. The data/information query services enable the user, for example, to query existing patient’s information or to search medical information with the help of a crawling service. The Grid middleware provides grid functionalities for security (authentication /authorization), resource management (e.g. resource allocation and job manager), information service (monitoring and discovery system for resource availability), data management (GridFTP and replica management) and data services support (e.g. Open Grid Service Architecture Data Access and Integration (OGSADAI)). The Globus Toolkit 4 (GT4) [4] is chosen to implement Grid middleware functions. Condor is used for job queuing, job scheduling and to provide high throughput computing. The bottom layer, the grid resource layer, contains computational resources, data resources (e.g. relational databases), and knowledge resources (e.g. software codes for computational intelligence algorithms) and networks (e.g. sensor networks for data acquisition). 2.2. BIOPATTERN Grid Prototype The prototype BIOPATTERN Grid aims to provide a platform for clinicians and researchers within the BIOPATTERN Consortium to share information in distributed bioprofile databases and computational resources to facilitate the analysis, diagnosis and care for brain diseases and cancer. Currently, the prototype connects five sites –the University of Plymouth (UOP), UK; the Telecommunication System Institute (TSI), Technical University of Crete, Greece; the University of Pisa (UNIPI), Italy; Synapsis
208
L. Sun et al. / Bioprofiling over Grid for eHealthcare
S.r.l. (Synapsis), Italy, and Tampere University of Technology (TUT), Finland (see Figure 2). Each site may hold bioprofile databases, Grid nodes, Condor pool, high performance cluster, algorithms pool, Grid portal, or an interface to remote data acquisition networks. At present, the bioprofile databases contain basic patient’s clinical information, EEG data (awake EEG at resting state) for dementia, and EEG data (MVEP) for brain injuries. The data are distributed into bioprofile databases at TUT, TSI, and/or UOP. The pool of algorithms, which is located at the UOP site, includes analysis algorithms for brain diseases, such as the Fractal Dimension (FD) and Independent Component Analysis (ICA) algorithms. In addition, UoP provides Grid nodes with Globus, a condor pool with 50 nodes and a web server to host the Grid Portal. Between UOP, TSI and TUT, we have developed two applications on the BIOPATTERN Grid for Brain diseases (for dementia and brain injury) (see Section 3). UNIPI and Synapsis are connected to the BIOPATTERN Grid via a Grid Node based on GT4. At UNIPI, the crawling services will be adapted to the BIOPATTERN Grid. Synapsis will provide an interface to the (remote) wireless acquisition network for automated remote data acquisition (see Section 4).
TSI
Wireless acquisition network
Bioprofile database
LAN
TUT Bioprofile database
Grid nodes (Globus) Bioprofile database
Grid nodes (Globus)
Grid node (Globus)
Synapsis
LAN Condor pool
User Interface
Internet UOP Grid nodes (Globus)
Bioprofile database
HTTPS
LAN Algorithms pool
Condor pool
UNIPI
LAN
Web server (Grid Portal) Figure 2. BIOPATTERN Grid Prototype
Grid node (Globus)
209
L. Sun et al. / Bioprofiling over Grid for eHealthcare
3. The use of the BIOPATTERN Grid to assess brain diseases
3.1. Bioprofiling over Grid for Early Detection of Dementia Dementia is a neurodegenerative cognitive disorder that affects mainly elderly people [5]. At present, several acetyl cholinesterase inhibitors could be administered for dementia of the Alzheimer's type, but for maximum benefits early diagnosis is important. Currently several objective methods are available that may support early diagnosis of dementia. Amongst others, the EEG which measures electrical activities of the brain offers the potential for an acceptable and affordable method in the routine screening of dementia in the early stages. Using current clinical criteria, delay between the actual onset and clinical diagnosis of dementia is typically 3 to 5 years. A limitation of current objective methods is that diagnosis is largely based on group comparisons, i.e. attempting to separate individuals into groups (Normal, AD, Parkinson’s, etc.). An alternative to this is individualized care through subject-specific biodata analysis. Such an approach would allow us, for example, to compute biomarkers which over time would represent the subjects ‘bioprofile’ for dementia, and to look for trends in the ‘bioprofile’ that arise over time to detect possible on-set of dementia [6]. Figure 3 illustrates the life of a fictitious individual called Mike who was born in France and lived in several countries before retiring to the U.K. At the age of 65, Mike is diagnosed with probable AD. To provide accurate diagnosis, his GP in UK requires his past and present medical information (bioprofiles) which could be located in databases in several different countries (e.g. UK and Italy). Additionally, information stored in the databases would be very large as these are Mike’s lifetime medical records such as EEG, MRI, clinical information, etc. Furthermore, analysis of the data would usually entail the use of complex algorithms which could take several hours to complete and could be held at various centers. Using the grid, to provide seamless access to geographically distributed data and high computational resources for complex analysis and data storage, more accurate and efficient diagnosis can be achieved.
France (0-20)
Germany (20-40)
Italy (40-60)
U.K. (60 -- )
Figure 3. Mike’s Life Itinerary
To illustrate the concept of Bioprofiling over Grid for early detection of dementia, a hypothetical patient pool consisting of 400 subjects, each with three EEG recordings was created. These data are hypothetical representation of recordings taken at three time instances akin to longitudinal studies carried out in reality. Each dataset consists of 21 channels of recording and is 1.3Mbytes. The recording duration is 4 minutes and the sampling rate is 128 Hz. The datasets are distributed to TSI, TUT and UOP sites. The FD analysis algorithm is used to compute the FD of each dataset. Through the portal, a GP can select a patient, e.g. Mike, and the algorithm is used to perform the analysis. Upon submission, Mike's information, including his previous medical records which are at TSI and TUT is retrieved and analyzed. Results are then returned to the user in near-real-time. These can be visualized using, for example, the
210
L. Sun et al. / Bioprofiling over Grid for eHealthcare
Figure 4. Canonograms Showing the Distribution of FD Values from EEG Analysis for Patient ‘Mike’
canonograms (see Figure 4) where changes in the EEGs indicating Mike’s conditions are shown. The canonograms (from left to right) show the FD value (or index) of the Mike’s EEG taken at time instances of 1 (data at TSI), 2 (data at TUT) and 3 (data at UoP) respectively. The FD value for the left canonogram indicates Mike in a normal condition with high brain activity, whereas the FD value for the right canonogram indicates Mike in a probable Alzheimer Disease with low brain activity. The middle one shows the stage in between. The changes (or trends) in the FD values provide some indication on the disease progression. This can help clinicians to detect dementia at an early stage, to monitor its progression and response to treatment. 3.2. The use of the BIOPATTERN Grid to analyze MVEP for brain injuries An important goal in this study is to generate electrophysiological markers which can be used to assess brain injuries on an individual basis by analysing evoked potentials (EPs) buried in raw EEG recordings. The idea is to use ICA methods to reveal single trial evoked potential activity of clinical interest and discard irrelevant components (e.g. those due to background EEG and artefacts). The generation of evoked potentials involve two phases – an encoding phase, during which each subject is asked to memorize 10 simple pictures each presented for 2 seconds. This is followed by a retrieval phase during which 20 pictures – 10 from the previous phase and 10 new were presented and subjects were asked to indicate whether they have seen each image before or not. This bi-phase process is repeated three times. Finally 20 pictures from the two phases are presented to the subjects. There were a total of 30 trials per subject in the encoding phase. Figure 5 illustrates the overall structure of the ICA-LORETA (LOw Resolution Electromagnetic TomogrAphy) method [9][10] that is used to analyse the EEG and EPs. The extended version of the Infomax ICA algorithm was used [7] and ICA was applied on concatenated single trial recordings for each patient [8]. Then, the components that contributed strongly to the EP, judged from the average of all trials, are selected. Extracting components containing event-related activity simplifies the problem of source localization and allows more accurate estimation of the brain regions involved in task. We have used the spherical infinite homogenous conductor model [9]. To validate our conclusions, the methodology will need to be tested on a large number of cases. Such a scenario makes analysis in a conventional
L. Sun et al. / Bioprofiling over Grid for eHealthcare
211
Figure 5. Schematic Representation of EEG Analysis
computer platform unrealistic. Thus, Grid computing seems to be the alternative solution to the provision of the required computing resources through a Virtual Organization. Data for such analysis are likely to be collected and stored at different centres. The plan is to store these in grid-enabled databases to facilitate remote access and to speed up analysis. We have integrated an ICA-LORETA algorithm into the Grid algorithm pool (located at UOP). Via the Grid Portal, the algorithm can be selected to analyze the MVEP for brain injuries. This is implemented as Matlab scripts compiled and using the Matlab run-time library over the Grid nodes. Figure 6 shows an example of results of such analysis via the portal, with topography maps (only for the two first ICA components) of one normal patient (top) and one patient (bottom). The head model, which is shown, is assumed to be an 8cm sphere and is represented with 8 slices (the lower slice is the left and the higher is the right).
Figure 6. Topography Maps of Normal Subjects and Patients with Brain Injury
212
L. Sun et al. / Bioprofiling over Grid for eHealthcare
Figure 7. Architectural Overview of the AmI-GRID Platform
4. Services for the BIOPATTERN Grid
4.1. Remote Data Acquisition Services for BIOPATTERN Grid An important service for the BIOPATTERN Grid is remote, automated data acquisition. The work described here is aimed at integrating the BIOPATTERN Grid and the AmI-GRID platform. The Ami-GRID, a complementary activity within the NoE, aims to provide a framework for automated data acquisition, management and exchange based on the Ambient Intelligence paradigm. The challenge is to effectively share and process medical data by exploiting the resources and the capabilities of both the AmI system and the GRID based infrastructure. The AmI system is composed of an integrated platform able to collect data acquired by Remote Data Acquisition systems (RDA), and to distribute data to registered devices (e.g. storage devices, analysis tools, etc.). The system allows not only automated acquisition and monitoring, but also storage of data in medical RDA environments and integration of related clinical information into the EHR (Electronic Health Record). The advantages of this heterogeneous platform are manifolds: a) To acquire real-time data from patients that can be automatically monitored and processed by software tools specifically developed for this purpose; b) To integrate data acquired from heterogeneous sources into a dedicated EHR; c) To perform additional on-line and off-line processing on the information available (the GRID infrastructure guarantees the ubiquitous access, security, transparency, robustness, authentication, tracing, etc.). As shown in Figure 7, the AmI system consists of environments permeated with non-invasive wireless sensors networks (WSN) implementing an intelligent RDA system, and a set of Monitoring Devices that are connected through a particular communication infrastructure which has a bus like topology: the AmI Bus. After a device has been connected to the bus, it will be able to publish an addressing scheme that enables all other devices to communicate with it. The devices can be subdivided
213
L. Sun et al. / Bioprofiling over Grid for eHealthcare
AMI FRAMEWORK
GRID Access Provider
BRIDGE NODE Engine for: - Choosing the GRID services - Exporting the AmI data - Security and Privacy - Results management
Grid Services (Globus, Unicore….) GRID INFRAS.
GUI Data acquired using the WSN are analysed by the GRID nodes GRID-services retrieve data for specific elaborations Figure 8. Architectural Overview of the Bridge Node
into Manager Devices and Client Devices. The Client devices are used to enable human operators to access the functionalities offered by the AmI system (e.g. monitoring devices responsible to notify alarms and to display the data produced by the RDA). The Manager Devices are in charge for managing AmI resources and processes (e.g. the Wireless Sensor Network Manager responsible to manage the addition, update and removal of wireless RDA elements). This view allows evolution from a databasecentric perspective to a totally distributed architecture which provides abstraction, automatic composition, scalability and evolution (for more details see [11]). The connection between the AmI platform and the BIOPATTERN GRID guarantees the sharing of the acquired data among a wide community of users, as well as the possibility to automatically process the new data by using a large spectrum of technologies. This way, it is possible for the GRID nodes to have a detailed view of both the AmI platform status and the available data, and possibly also to interact with the RDA system by using the adequate services provided by the platform itself. The interaction between the AmI platform and the GRID infrastructure is mediated by a Bridge Node responsible for the interaction with the GRID network, i.e. for requesting specific services on the acquired data (e.g. processing algorithms or simply storing services) and to provide the functionalities offered by the framework to the GRID community. This node will be also being responsible for issues such as security and privacy of data acquired through the RDA, thus allowing e.g. the anonymisation of personal information. The architecture of the Bridge Node is shown in Figure 8. The workflow of this system can be seen from two different points of view: The AmI platform acquires data using the RDA system and exploits the services of the GRID infrastructure to perform analysis on the data; A node of the GRID infrastructure needs to find clinical data stored in the database of the AmI framework for specific elaborations (KDD, epidemiological studies etc). In the first case the Bridge Node is responsible for the interaction with the GRID middleware, and for the location and exploitation of the required services. In the second case, the AmI platform represents a special repository of data from which a large amount of information can be obtained (integration with EHR). In this case the GRID Middleware needs to interact with the AmI framework in order to verify the availability of data for the desired elaborations. As discussed in Section 3, individualization of care through subject-specific analysis would significantly advance the early detection of brain disorders. In this context, the AmI framework will significantly enhance the possibility to personalise the diagnosis on the specific case by implementing a special environment in which patients
214
L. Sun et al. / Bioprofiling over Grid for eHealthcare
can be monitored continuously over long periods, with the measured parameters measured tailored to individual subjects according to their “bioprofile”. 4.2. Crawling Services for BIOPATTERN Grid A Distributed Focused Crawling surfs the Web and gathers documents on a specific topic. It can be used like a search engine to obtain more information faster, by narrowing its crawl to specific subjects. For a particular topic, the focused crawler results contain many more relevant specific documents than the collection returned by a generic search engine. An important advantage of a Web Crawling System deployed on a GRID stems from the fact that such a service would be offered to individuals that are entitled to access the highly distributed computational power of a GRID, eliminating the need of a central authority/repository such as a unique search engine. On the other hand it is foreseeable that individuals would employ a GRID crawler differently from general purpose search engines, using it to maintain a bookmark of resources on a persistent and topic-specific interest, rather than to answer sporadic queries. In the future, we expect that HPC Grid applications which exploit more complex parallelism patterns will be required to sustain an agreed QoS. To obtain this goal, applications will require to change dynamically the set of resources used for their execution; this requires a new generation of application launchers with the ability to interact with the application and the underlying Grid resources. The Distributed Focused Crawling Service may be considered part of this scenario. It has been implemented using ASSIST, a HPC programming environment that provides language constructs to express adaptable and reconfigurable components [12]. The ASSIST programming environment provides a set of integrated tools to address the complexity of Grid-aware application development. Its aim is to offer to Grid programmers a component-oriented programming model, in order to support and enforce reuse of already developed code in other applications, as well as to enable full interoperability with existing software, both parallel and sequential, either available in source or in object form. The adopted system architecture model is Distributed and Object-Oriented, using as programming model a Multi-Tier implementation approach. The main core of the system is the second tier (Figure 9), it gets as input the query string data provided by the user and sets off a GRID distributed crawling search on the Web. The result of this search is a set of links relevant to the user’s query that are stored in the Cache module (representing the third tier of the architecture). The computationally demanding part (e.g. graph algorithms) is implemented in C++ using the ASSIST framework and tools to distribute the elaboration in a high performance environment: when new resources are required in the Web exploration phase, the distributed modules can each query asynchronously a (potentially large) number of web services that either perform a fresh acquisition or retrieve a stored copy of the document. The Distributed Focused Crawling Service has been deployed and tested on a Fedora Core 1 cluster consists of 34 nodes connected via Gigabit Ethernet and tuned for high performance. For the Web exploration phase a variable number of web services, based on Axis 1.3, have been placed on the network. Axis is a reliable and stable open-source implementation of the SOAP protocol and it is the base on which to implement Java Web Services. The Distributed Focused Crawling Service has been developed under the Grid.it Project 2003-2005 [13], and its deployment on the BIOPATTERN GRID will provide a HPC application in order to assist the medical community to increase efficiency, accuracy and speed up of the information retrieval
215
L. Sun et al. / Bioprofiling over Grid for eHealthcare
FIRST TIER HTML INPUT
SECOND TIER Request
THIRD TIER
Servlet
GOOGLE
InitializerSeqModule Cache HTML RESULT
Focused Crawling Results
ASSIST MODULES
CrawlerParModule
Classifier Analyser
FetcherParModule
WEB Figure 9. Architecture of the Crawling System
process [14]. Due to architectural issues (i.e. the parallel modules share memory usage), it is not possible to distribute the computation directly on the BIOPATTERN GRID nodes (UoP, TSI, Pisa, TUT and Syapsis) because the integration of the ASSIST framework on Globus Toolkit 4 is still under development by the Grid.it team. The first step of the deployment will involve the transformation of the application into a Grid Service available through the Grid portal. In the first instance, the Grid Service will be localized in Pisa node (Figure 2) to exploit the connection with the clusters to execute the modules and provide the results. Subsequently other instances could be placed on other BIOPATTERN Grid nodes with compatible computational resources.
5. Conclusions The BIOPATTERN Grid aims to provide a Grid-enabled network within the BIOPATTERN Consortium to facilitate secure and seamless sharing of bioprofile databases and to support the acquisition and analysis of bioprofiles to combat major diseases on an individual basis. This is an ongoing project and the results presented here are preliminary. In future, the Grid prototype will be extended to include more resources (e.g. more grid nodes, clinical data, algorithms and computing resources) and more applications and services. Due to the nature of healthcare, the BIOPATTERN Grid will need to address several issues before it can move from research prototype to actual clinical tool. These include regulatory, ethical, privacy and security, and quality of service issues. However, this should not prevent us from looking into the future possibilities of ehealthcare with Grid computing.
216
L. Sun et al. / Bioprofiling over Grid for eHealthcare
Acknowledgement The authors would like to thank Dr. C. Bigan from EUB, Romania for providing the original EEG data sets and Dr. G. Henderson for providing the algorithms for computing the FD index. We acknowledge the financial support of the European Commission (The BIOPATTERN Project, Contract No. 508803) for part of this work.
References [1]. S. R. Amendolia, F. Estrella, C. D. Frate, J. Galvez, W. Hassan, T Hauer, D Manset, R McClatchey, M Odeh, D Rogulin, T Solomonides and R Warren, “Development of a Grid-based Medical Imaging Application”, Proceedings of Healthgrid 2005, from Grid to Healthgrid, 2005, pp.59-69. [2]. S. Lloyd, M. Jirotka, A. C. Simpson, R. P. Highnam, D. J. Gavaghan, D. Watson and J. M. Brady, “Digital mammography: a world without film?”, Methods of Information in Medicine, Vol.44, No. 2, pp. 168-169, 2005. [3]. J. S. Grethe, C. Baru, A. Gupta, M. James, B. Ludaescher, M. E. Martone, P. M. Papadopoulos, S. T. Peltier, A. Rajasekar, S. Santini, “Biomedical Informatics Research Network: Building a National Collaboratory to Hasten the Derivation of New Understanding and Treatment of Disease”, Proceedings of Healthgrid 2005, from Grid to Healthgrid, 2005, pp. 100-109. [4]. Foster, “Globus Toolkit Version 4: Software for Service-Oriented Systems”, Proceedings of IFIP International Conference on Network and Parallel Computing, 2005, pp. 2-13. [5]. D.S. Knopman, S.T. DeKosky, J.L. Cummings, H. Chui, J. Corey-Bloom, N. Relkin, G.W. Small, B. Miller and J.C. Stevens, “Practice parameter: diagnosis of dementia (an evidence-based review): report of the quality standards subcommittee of the American Academy of Neurology”, Neurology, Vol. 56, No. 9, pp. 1143-1153, 2001. [6]. G. T. Henderson, E. C. Ifeachor, H. S. K. Wimalartna, E. Allen and N. R. Hudson, “Prospects for routine detection of dementia using the fractal dimension of the human electroencephalogram”, MEDSIP00, pp. 284-289, 2000. [7]. T-W Lee, M. Girolami, TJ. Sejnowski, “Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources”. Neural Computation 1999;11(2):606633. [8]. T-P Jung, S. Makeig, M. Westerfield, J. Townsend, E. Courchesne, T. J. Sejnowski, “Removal of eye activity artifacts from visual event-related potentials in normal and clinical subjects”. Clinical Neurophysiology 111 (2000) 1745-1758. [9]. J. C. Mosher, R. M. Leahy, and P.S. Lewis “EEG and MEG: Forward Solutions for Inverse Methods” IEEE Transactions on Biomedical Engineering Vol. 46, No 3, March 1999 [10]. R. D. Pascual-Marqui. “Review of methods for solving the EEG inverse problem” International Journal of Bioelectromagnetism 1999, 1: 75-86 [11]. M. Lettere, D. Guerri, R. Fontanelli. “Prototypal Ambient Intelligence Framework for Assessment of Food Quality and Safety”, 9th Int. Congress of the Italian Association for Artificial Intelligence (AI*IA 2005) – Advances in artificial Intelligence, pp. 442-453, Milan (Italy), Sep. 21 - 23, 2005 [12]. M. Aldinucci, M. Danelutto, A. Paternesi, R. Ravazzolo and M. Vanneschi, “Building Interoperable Grid-aware ASSIST Applications via Web Services”, Parallel Computing Conference, Sep. 2005. [13]. Grid.it: “Enabling Platforms for High-Performance Computational Grids Oriented to Scalable Virtual Organizations”, http://grid.it/. [14]. K. Cerbioni, E. Palanca, A. Starita, F. Costa, P. Frasconi, ''A Grid Focused Community Crawling Architecture for Medical Information Retrieval Services'', 2nd Int. Conf. on Computational Intelligence in Medicine and Healthcare, CIMED’2005.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
217
SARS Grid—An AG-Based Disease Management and Collaborative Platform Shu-Hui Hung (NCHC) Tsung-Chieh Hung (Wei-Gong Memorial Hospital) Jer-Nan Juang (NCHC) National Center of High-Performance Computing 7, R&D VI, Hsinchu, Taiwan 300
Abstract
This paper describes the development of the NCHC’s Severe Acute Respiratory Syndrome (SARS) Grid project—An Access Grid (AG)-based disease management and collaborative platform that allowed for SARS patient’s medical data to be dynamically shared and discussed between hospitals and doctors using AG’s video teleconferencing (VTC) capabilities. During the height of the SARS epidemic in Asia, SARS Grid and the SARShope website significantly curved the spread of SARS by helping doctors manage the in-hospital and in-home care of quarantined SARS patients through medical data exchange and the monitoring of the patient’s symptoms. Now that the SARS epidemic has ended, the primary function of the SARS Grid project is that of a web-based informatics tool to increase pubic awareness of SARS and other epidemic diseases. Additionally, the SARS Grid project can be viewed and further studied as an outstanding model of epidemic disease prevention and/or containment. Keywords Video teleconference (VTC), Severe Acute Respiratory Syndrome (SARS), Grid, epidemic disease, Grid technology, SARS Grid, Access Grid (AG), e-health
1. Introduction In early 2003, the world’s population began to realize that it was facing a very real and terrifying threat to its health and well-being. The global SARS outbreak developed rapidly and dramatically first across Asia and then over to the North American continent. As more and more cases of this bizarre new life-threatening disease were discovered, a slight panic began to set in worldwide. The sheer speed at which the SARS outbreak was spreading also greatly alarmed the world’s top medical experts. There had been no in disease of this magnitude or severity in recent medical history. There simply was no historical information on which to begin looking for a cure for this disease. By April, 2003, the severity of the SARS outbreak in Taiwan had began to overwhelm the local health infrastructure. The panic over the SARS outbreak in Taiwan became so intense that, by the end of April, the federal government made the
218
S.-H. Hung et al. / SARS Grid—An AG-Based Disease Management and Collaborative Platform
difficult decision to quarantine over 100 staff and patients at a local Taipei hospital. The entire hospital was quarantined due to an unusually high incidence of its occupants coming down with SARS-related symptoms. To make matters worse, the Taipei City Government’s Department of Health made the decision to try to bring back all the visitors who had passed through the hospital for two weeks prior and test them for the SARS virus. Unfortunately, the people who had visited the hospital were extremely reluctant to return to for testing for fear of also being quarantined. It was later documented that some of the people who had visited the hospital had, in fact, contracted the SARS virus while there but had not returned for testing, treatment, and quarantine. Due to this unfortunate situation, a second wave of the SARS infection hit Taiwan’s general population. This second wave was believed to have been further spread via the public transportation system, physical contact between infected victims and their family members, and medical staff who became infected. Many of the victims of the second wave of the SARS epidemic also had to be hospitalized. This put a tremendous strain on the already tremendously overburdened area hospitals.
2. Methodology SARShope Objectives: x To increase SARS awareness including the risks it posed to public health x To provide quarantined SARS victims with quality real-time medical assistance x To provide service to and efficiently monitor SARS patients who were quarantined either in their home or hospital x To promote a system by which hospitals could transfer SARS patient’s medical data among themselves x To provide a virtual conference room for medical experts to discuss SARS cases x To promote more effective diagnosis and treatment of SARS victims To provide real-time accurate statistical reporting on the SARS epidemic AG over the Internet: The SARS Grid infrastructure was composed primarily of AG teleconferencing nodes that were set up inside and outside of quarantined areas (e.g. hospitals and clinics) within Taiwan. These AG nodes allowed doctors to meet in “virtual” discussion rooms (Fig. 2) and discuss, via VTC, different SARS case studies. SARS Grid allowed for the SARS patient’s medical data, including high-resolution X-rays (Fig. 1), diagnoses, and treatment methods, to be dynamically shared between hospitals and doctors over the AG.
S.-H. Hung et al. / SARS Grid—An AG-Based Disease Management and Collaborative Platform
219
Figure 1: AG online discussion room and x-ray sharing
Figure 2: The AG concept
3. Conceptual Framework The SARShope website was designed to help contain the spread of the SARS virus by providing public education, patient diagnosis and treatment, monitoring, and management. For the home-bound SARS patient, the SARShope website also helped provide residential isolation control and included an automatic body temperature monitoring function. If the home-bound SARS patient were to input data indicating that his health condition was worsening, his local health center would receive an alert requiring patient follow up. The local health center could then arrange for, if necessary, an ambulance to pick up and transfer the patient to a hospital for treatment. The SARShope website also incorporated a function that alerted the patient’s doctor while, at the same time, sending the patient’s detailed medical history. If the doctor needed additional assistance diagnosing the case, he could easily call a meeting with other doctors from other hospitals using the AG’s VTC feature.
220
S.-H. Hung et al. / SARS Grid—An AG-Based Disease Management and Collaborative Platform
The local health centers were able to use the SARShope website to help control quarantined rooms and to update the patient’s medical information. Based on the information submitted by the local health centers, the areas where the highest outbreak of the SARS virus had been detected could be easily identified and, therefore, avoided (Fig. 3). During the SARS outbreak, the NCHC setup seven AG nodes at local health centers, medical centers, and SARS-dedicated hospitals (Fig. 4).
Figure 3: SARS Grid flow chart
Figure 4: AG nodes in Taiwan
S.-H. Hung et al. / SARS Grid—An AG-Based Disease Management and Collaborative Platform
221
4. System Description The SARShope web site included the patient’s personal information, his ten day contact history (i.e. the individuals the patient had come in contact with over the past ten days), and medical information including medical images (Fig. 5). The SARShope website also contained many management functions to help minimize the spread of the SARS virus. These functions included the number of suspected and confirmed SARS cases, the number of rooms in hospitals and clinics that were quarantined, and the number of home-bound SARS victims (Fig. 6).
Figure 5: Patient’s contact history
Figure 6: Patient pathology
5. Conclusion The SARS Grid platform and SARShope website offered medical professionals an effective and convenient way to diagnose, treat, and monitor quarantined hospital and home-bound SARS victims. The SARShope website also helped to curb the spread of the SARS virus by increasing public awareness of the SARS epidemic.
222
S.-H. Hung et al. / SARS Grid—An AG-Based Disease Management and Collaborative Platform
6. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
WHO "Severe Acute Respiratory Syndrome (SARS)” http://www.who.int/csr/sars/en/ American CDC http://www.cdc.gov/ http://www.cdc.gov/ncidod/sars/cht/index.htm Health Canada http://www.hc-sc.gc.ca/english/protection/warnings/sars/ Singapore Ministry of Health http://www.gov.sg/moh/sars/index.html Hong Kong Department of Health http://www.info.gov.hk/dh/ap.htm http://www.ssm.gov.mo/design/news/c_cdc_news.htm http://www.doh.gov.tw/newverprog/proclaim/sars_list.asp http://www.cdc.gov.tw/sars/ http://sars.nhri.org.tw/index.php http://sars.health.gov.tw/ http://www.health.gov.tw/index.asp http://ntuh.mc.ntu.edu.tw/med/sars/ http://sars.heart.net.tw/
Part IV Knowledge Discovery on HealthGrids
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
225
A Secure Semantic Interoperability Infrastructure for Inter-Enterprise Sharing of Electronic Healthcare Records Mike Boniface, E. Rowland Watkins, Ahmed Saleh IT Innovation Centre, University of Southampton 2 Venture Road, Chilworth Science Park, Southampton SO16 7NP, UK Asuman Dogac Software Research and Development Center, Middle East Technical University, Turkey Marco Eichelberg Kuratorium OFFIS e. V., Oldenburg, Germany Corresponding author: Mike Boniface, tel: +44 23 8076 0834, fax: 44 23 8076 0833, email: mjb@it-innovation.soton.ac.uk
Abstract. Healthcare professionals need access to accurate and complete healthcare records for effective assessment, diagnosis and treatment of patients. The non-interoperability of healthcare information systems means that interenterprise access to a patient’s history over many distributed encounters is difficult to achieve. The ARTEMIS project has developed a secure semantic web service infrastructure for the interoperability of healthcare information systems. Healthcare professionals share services and medical information using a web service annotation and mediation environment based on functional and clinical semantics derived from healthcare standards. Healthcare professionals discover medical information about individuals using a patient identification protocol based on pseudonymous information. The management of care pathways and access to medical information is based on a well-defined business process allowing healthcare providers to negotiate collaboration and data access agreements within the context of strict legislative frameworks. Keywords. Healthcare information systems, electronic healthcare records, security, semantic interoperability, web services, P2P
226
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
1. Introduction The objective of the ARTEMIS project is to develop a semantic web service based interoperability infrastructure for healthcare information systems [3]. A key challenge for the interoperability of information systems in the healthcare domain is the localisation and access to Electronic Healthcare Records (EHR) across healthcare organization boundaries. The complexity of the problem is increased by the absence of a globally accepted standard for EHRs, the absence of unique patient identifiers in most countries, the heterogeneity of systems used in the healthcare domain, the longevity of data and devices, the high availability requirements, and the strict data protection requirements that apply to clinical documents. ARTEMIS aims to raise the quality of healthcare by providing an interoperability infrastructure which will extend the healthcare enterprises by making their services available to others; will extend the life of existing systems by exposing previously proprietary functions as web services; will open new business opportunities by making it easy to connect with other parties in the healthcare domain. ARTEMIS project takes a different approach regarding the interoperability of healthcare information systems. We focus on processing in terms of web services rather than recording and documentation of electronic healthcare records. Our approach allows a standard way to access data rather than standardizing the data itself. In this paper, we present the core components of the ARTEMIS infrastructure that together support the secure sharing of EHRs. These include a semantic web service annotation and mediation environment, a patient information discovery protocol and a security and data management infrastructure that allows healthcare organizations to negotiate data access agreements and participate in care pathways
2. Semantic interoperability of Electronic Healthcare Records Healthcare organizations maintain Electronic Healthcare Records (EHR) of patients’ treatments containing data on encounters, laboratory tests results, diagnosis reports and prescriptions. The EHR is generally stored within a healthcare information system, which is typically tailored to satisfy the needs and medical function of individual institutions. EHR standards exist to support the interoperability of healthcare information systems including HL7 CDA (Clinical Document Architecture) [13], GOM (GEHR Object Model) [11] and CEN’s ENV 13606 [5]. The quantity and complexity of these standard information models means that interoperability between EHR’s is rarely achieved, as compliance can be interpreted in many ways. This results in heterogeneous distributed EHRs with systems support different message structures. The ARTEMIS infrastructure addresses interoperability of Electronic Healthcare Records using semantic web services that operates on part of a patient’s EHR. Each web service is semantically annotated both in terms of its functionality and clinical data that is to be processed. This allows each healthcare organization to describe the semantics of a service’s operation and the message structure required by their healthcare information system. The annotations are based extensions to OWL-S [21] using ontologies derived from existing healthcare standards represented. Example
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
227
functional and clinical concept ontologies can be shown in Figure 1 and Figure 2. Although the message structures may be different between organizations, automatic mediation can be achieved through structural and semantic mappings to clinical concept ontologies. Healthcare information systems vendors implement web services using existing toolkits such as Axis[2] and .NET[17], in accordance with WS-I Basic Profile[27] and WS-I Basic Security Profile [28]. Developers then use ARTEMIS’s integration and annotation environment to develop semantic service descriptions. These semantic service descriptions are used by the ARTEMIS mediation service to translate between the different message structures of the service consumer and service provider during web service invocation. Firstly, a developer creates an OWL message ontology from the web service input and output message types. This process normalizes the XML schema definitions found in a WSDL document to an OWL representation. The message ontology is then mapped to a standards based clinical concept ontology using the OWLmt ontology mapping tool [20]. The mapping tool allows the expression of formal similarities between object properties and transformation functions for data type properties. These functions support complex structural transformations allowing data type properties to be aggregated, de-aggregated or converted using appropriate algorithms from source to target individuals. During service invocation, an instance concept of the target ontology is created for each source concept when the two concepts are related via a similarity mapping. In the same way an AttributeTransformation defines the equivalence between source and target data type properties. Once the relationships between two ontologies are defined through ontology mappings, the individuals of source ontology can be transformed into target ontology individuals by evaluating the semantic mappings.
Figure 1: Example functionality ontology based on HL7 trigger events
228
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
Figure 2: Example clinical concept ontology based on CEN
3. Discovering patient information A key challenge for the cross-enterprise exchange of EHRs is the absence of a unique identifier that would allow to identify records pertaining to a particular person. In ARTEMIS we have developed a protocol that allows to identify locations of patient records for a given patient and to access these records if granted, under consideration of the legal and technical requirements. The protocol combines cryptographic techniques with semantic annotation and mediation. The protocol guarantees that the request itself cannot be interpreted by any party involved in the request, i.e. no plaintext information can be decoded or derived from the request or the responses to the request, except for a probability that requestor and provider have identifying information pertaining to the same person. A result returned by the protocol identifies healthcare providers as the holder of the desired records for a patient and provides so called candidate IDs that identify the desired patient in the respective hospital that provided the ID only. Most clinical records are still kept and maintained at the place of their creation instead of keeping them in central repositories to provide access to different parties of the healthcare domain. Usually, medical records of patients who need long-term treatment may be located at one or more family doctor's practices, several specialists, labs and a number of hospitals. In particular, the patient may not even be aware of all locations where records relevant to a particular medical problem may be kept. Locating medical records is complicated by the fact that there is no unique patient identifier that could be broadcasted as a query in order to locate information pertaining to one patient. While countries such as Turkey, Norway, or Sweden maintain a national person identifier that is commonly used as the index key for medical records, no such unique identifier is available in most other countries, for historic reasons or due to data protection regulations. This means that a query applicable to cross-border
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
229
healthcare delivery can only be based on the patient demographics that are commonly available, such as: • patient's given and family name, • patient’s name at birth • date and place of birth, • sex, • nationality • postal address. Additionally, it should be noted that the set of demographics available may depend on the location (e. g. a national patient identifier would certainly be included in any query within a country in which it is valid) and on the patient's health condition, that is whether or not the patient is able to provide the doctor with additional information not contained in the passport or driving license which may be the only source of information available for an emergency patient. It should also be noted that typing errors, phonetic misunderstandings and ambiguousness, differing information, and missing information in medical record archives are not uncommon and may need to be accounted for, using for example phonetic encoding techniques. An additional challenge for cross-border application is the different character sets used in different European countries. Since medical records are generally considered to be sensitive personal information, it would neither be appropriate or legal for a healthcare enterprise to allow third parties to browse through the demographics of the local record archive, nor would it be appropriate if a plaintext request of the form: “hospital X is looking for prior psychiatry records for patient Hans Friedrich Müller, born 1960-12-24 in Hamburg”. The query itself communicates information to the recipient that needs to be protected under the applicable data protection rules, namely the facts that Mr. Müller currently receives treatment at hospital X and may have had a prior psychiatric treatment. For the purpose of patient identification in the ARTEMIS network we use control numbers along with semantic annotation and a probabilistic record linkage to address the possible “fuzziness” of demographic data and at the same time preventing a premature and unlimited communication of personal data. Control numbers are a concept that is used in the epidemiological cancer registries in Germany to allow record linkage of anonymized records that describe cancer cases and are collected independently from multiple sources, as described in [21]. Given a set of control numbers describing a query and a larger number of sets of control numbers describing all patients in a record repository, “matches” in the repository can be identified using record linkage, characterized by [24] as “the methodology of bringing together corresponding records from two or more files or finding duplicates within files”. As mentioned above, we can expect different healthcare providers to use different, though certainly overlapping, sets of control numbers accounting for country or region specific aspects such as phonetic encoding or national unique patient identifiers. Since control numbers can only be compared for binary equality but not evaluated in any other way, it is of prime importance for all parties participating in the protocol to understand exactly what each control number means and which control number is supported by which party. The use of ontology based semantic annotation allows us to introduce the amount of flexibility into the protocol that is needed to make it work in
230
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
an international setting where different sets of control numbers might be supported by different actors. Each request or response dataset consisting of a list of control numbers is encoded as a DemographicsOntology using OWL. Since different healthcare providers will typically use different sets of control numbers, more than one DemographicsOntology might exist. If the requestor and a record repository in the protocol use different demographics ontologies, then a direct or indirect mapping between these ontologies is used during the record linkage process
4. Supporting virtual healthcare organizations Virtual organizations can be defined as flexible, secure and coordinated resource sharing amongst dynamic collections of individuals, organizations and resources, in order to achieve a common purpose [15]. In healthcare, virtual organizations allow clinical staff and healthcare providers to collaborate with the objective of delivering patient care through the sharing of EHRs. The lifecycle, structure and dynamics of a virtual organization should be defined based on the business needs of its participants. However, existing infrastructure technologies tend to provide support for virtual organizations with constrained characteristics that are not well matched to the healthcare domain. Patient referrals are the key entry point to healthcare systems that allow intra- and inter-enterprise collaboration in the delivery of patient care. Referrals allow care pathways to be created between primary care providers, specialists, labs and other healthcare organizations. For example, Figure 3 shows a typical care pathway where an Accident and Emergency patient is a referred from a primary care provider sending a patient to a specialist hospital for specific medical procedures to be performed and attaching relevant clinical information on the patient's case [13]. Each referral represents a specific pathway through a healthcare organization and consists of a series of administrative and medical tasks. These referrals are implemented within a strict regulatory framework that is enforced to ensure the protection of personal data and outlines conditions and rules in which processing is allowed. There are many such regulations at European level [6], [10] and additional legislation implemented within member states [8]. According to EU Directive 95/46/EC, if a healthcare provider maintains personal data on its patients, the healthcare provider is identified as a data controller and is responsible for protecting that data against unauthorized use. If a healthcare provider wants to access personal data within another organization they are identified as a data processor. For the communication to occur between data controller and data processor consent must be obtained from the patient and a contract between the two parties must exist that defines the scope of access to patient data including conditions such as what data is to be accessed, what use will be made of the data and how the data will be accessed. Access to EHRs is currently controlled by an out-of-band business process that permits negotiation of data access agreements. To access EHRs an external organization has to request access by contacting a data guardian within the trust. The data guardian is an individual that is responsible for controlling access to patient data across the boundaries of an organization. For requests to access specific patient
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
231
records a data guardian should ensure that patient consent has been given before data is shared. The consent allows a patient to express privacy preferences regarding their data defining which organizations are authorized to access the data and for what purpose (e.g. the referral context).
Figure 3: Care pathways
The existing business processes show that collaborations between healthcare providers can be dynamic and need to be represented as bi-lateral data access agreements between data controller and data processor. ARTEMIS provides a collaboration infrastructure to support healthcare business processes. The infrastructure builds on a Process Based Access Control component developed for GRIA, a secure web service grid infrastructure for B2B service provision [12]. PBAC provides a means of process-oriented access control to enforce a business processes associated with a stateful resource model. Process-based authorization is grounded in the service’s Web Service interface and controls access to web service operations based on; the user making the request, the state of the process referred to by the user and the operation requested by the user. The ARTEMIS collaboration infrastructure allows healthcare providers to manage relationships required to participate in care pathways. A collaboration service is provided that supports the negotiation and approval of data access agreements. Once a data access agreement is established healthcare providers provide a reference to an agreement when accessing EHR related web services. Access to operations that process data access agreements and associated EHRs are controlled by PBAC policies. The core elements of the data access agreement schema are shown in Table 1.
232
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
Agreement field Data Access Agreement id Subject
Type String
Comment Data access agreement id
Candidate Id
Data Controller
X509 Certificate
Data Protection Officer Data Processor
X509 Certificate X509 Certificate
Requester Purpose
X509 Certificate String/Clinical Term
Other Information
String
Start Date
Date
Termination Date
Date
The subject of the data access request (Patient) Healthcare organization responsible for maintaining the healthcare information The individual that approved the data access request Healthcare organization wanting to access the information The individual making the request The purpose(s) for which the data will be processed Any other information necessary e.g. likely consequences of the processing, and whether they envisage the data being disclosed to a third party The date from which the agreement is valid The termination date of the agreement
Table 1: Data Access Agreement
5. Pilot application An ARTEMIS pilot application is being deployed by healthcare providers located in two European countries to demonstrate the interoperability of healthcare information systems across organizational and country boundaries. The pilot application includes healthcare providers South East Belfast Healthcare Trust (SEBT) in Belfast, Northern Ireland and Hacettepe Hospital in Ankara, Turkey. Each healthcare provider operates within distinct legislative domains and has different healthcare information systems to support patient care [7], [23]. The pilot application scenario is based on the treatment of a young person with a variety of behavioral problems. The scenario consists of a three healthcare organizations within Belfast, Northern Ireland collaborating in the provision of medical care; a General Practice (GP), a Young Persons Centre (YPC) and SEBT’s community system. The GP provides primary care to patients and provides the entry point to the healthcare system. The YPC is a secondary care providing for with a Regional Adolescent Psychiatric Inpatient Service for young people aged 13-18 with mental illness and psychological problems. The centre utilizes a multi-disciplinary team approach to treatment and offers a wide variety of therapeutic interventions with a focus on mid to long-term treatment. The SEBT community system is a trust wide IT
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
233
service that provides access and management of Electronic Healthcare Records that can be access by different healthcare providers within the trust. The pilot application demonstrates the integration of three healthcare information systems PARIS [23], CorTTex [7] and Care2X [4]. Medical web services have been developed for functionality such as patient referral, patient information discovery and retrieve information for display. The web services have been semantically described using the ARTEMIS integration environment allowing semantic mediation between the different message structures and encoding schemes supported by each of the healthcare providers. For example, the GP uses proprietary message structure with an HL7 V3.0 [13] Patient Referral ontology using Read Codes [22] to describe diagnosis information. The YPC uses a proprietary message structure with an HL7 V3.0 Patient Referral ontology using ICD10 to describe diagnosis information. All organizations within the scenario have their expertise classified in accordance with the HIPAA Product Taxonomy [14].
6. Conclusion In this paper, we have presented a secure semantic web service infrastructure for the interoperability of healthcare information systems. The infrastructure provides key components that can support the collaboration of healthcare providers maintaining electronic healthcare records in heterogeneous information systems under different domains of control. Connectivity is provided through standard web services and data integrated through semantic mediation. Healthcare professionals locate patient information using a cryptographic patient identification protocol and obtain authorization to access electronic healthcare records through a data access agreement protocol all within the consideration of data privacy legislation. The infrastructure software development has now been completed and is being evaluated through a series of prototypes by healthcare providers in Northern Ireland and Turkey incorporating clinically representative data. Initial evaluation results show that interoperability based on semantic web service mediation can provide an effective mechanism for integrating disparate healthcare information systems allowing developers to describe access to legacy functionality at higher level of abstraction than previous syntactic representations. However, significant improvements in performance and management of semantic-based technologies will be required before ARTEMIS technology could be realistically adopted by the healthcare market.
Acknowledgements The ARTEMIS project has received research funding from the EC's Sixth Framework Programme (project IST-2103 STP under the eHealth Action Line of the Information Society Technologies Programme).
234
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
References [1] Aden T, Eichelberg M, Thoben W, “A fault-tolerant cryptographic protocol for patient record requests”, Proceedings of EuroPACS-MIR 2004 [2] Apache AXIS, http://ws.apache.org/axis/ [3] Artemis Project, http://www.srdc.metu.edu.tr/webpage/projects/ [4] Care2X, http://www.care2x.com/ [5] CEN TC/251 (European Standardization of Health Informatics) ENV 13606, Electronic Health Record Communication, http://www.centc251.org/ [6] Council Of Europe – Committee of Ministers, Recommendation No. R(97)5 of The Committee Of Ministers to Member States on the Protection Of Medical Data, Council of Europe Publishing, Strasbourg, 12 February 1997 [7] CorTTex, http://www.corttex.nl/ [8] Data Protection Act 1998, UK parliament, http://www.hmso.gov.uk/acts/acts1998/19980029.htm [9] Dogac, A., Laleci, G., Kirbas, S., Kabak, Y., Sinir, S., Yildiz, A., “Artemis: Deploying Semantically Enriched Web Services in the Healthcare Domain”, Information Systems Journal (Elsevier) special issue on Semantic Web and Web Services, accepted for publication [10] Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, OJ L, 23 Nov. 1995, http://europa.eu.int/com-m/internal_market/privacy/law_en.htm [11] Good Electronic Health Record, http://www.gehr.org [12] GRIA, http://www.gria.org [13] Health Level 7 (HL7), http://www.hl7.org [14] HIPAA Product Taxonomy, http://www.hipaa.org/ [15] Foster, C. Kesselman, S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, http://www.globus.org/research/papers/anatomy.pdf [16] ISO 14598-1 (1998) Information technology – Software product evaluation – Part 1: General guide. ISO, Geneva. [17] Microsoft .NET Framework 1.1, http://msdn.microsoft.com/netframework/ [18] OASIS, http://www.oasis-open.org [19] OWL Web Ontology Language, http://www.w3.org/TR/owl-features/ [20] OWLmt Toolkit, http://sourceforge.net/projects/owlmt/ [21] OWL-S 1.0, http://www.daml.org/services/owl-s/1.0/ [22] Read Codes, http://www.nhsia.nhs.uk/terms/pages/readcodes_intro.asp [23] PARIS, http://www.in4tek.com/ [24] Thoben, W., Appelrath, H.-J., Sauer, S., “Record linkage of anonymous data by control numbers”, in From Data to Knowledge: Theoretical and Practical Aspects of Classification, Data Analysis and Knowledge Organisation, Oldenburg (1994), pp. 412-419, Springer.
M. Boniface et al. / A Secure Semantic Interoperability Infrastructure
235
[25] Winkler, W. E. (1999). The State of Record Linkage and Current Research Problems (Technical Note). Washington DC, USA: U. S. Bureau of the Census. [26] WS-Interoperability (WS-I), http://ws-i.org/ [27] WS-I Basic Profile 1.0, http://www.ws-i.org/Profiles/BasicProfile-1.0-2004-0416.html [28] WS-I Basic Security Profile 1.0, http://www.ws-i.org/Profiles/ BasicSecurityProfile-1.0-2004-05-12.html
236
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Constructing a Semantically Enriched Biomedical Service Space: A Paradigm with Bioinformatics Resources Vassilis KOUTKIAS1, Andigoni MALOUSI, Ioanna CHOUVARDA, Nicos MAGLAVERAS Lab. of Medical Informatics, Aristotle University of Thessaloniki, Greece
Abstract. Biomedical applications are becoming increasingly reliant on resource integration and information exchange within global solution frameworks that offer seamless connectivity and data sharing in distributed environments. Resource autonomy and data heterogeneity are the most important impediments towards this potential. Aiming to overcome these limitations, we propose an implementation of the service-oriented model towards the construction of an open, semantically enriched biomedical service space that enables advanced service registration, selection and access capabilities, as well as service interoperability. The proposed system is realised by defining service annotation ontologies and applying software agent technology as the means for service registration, matchmaking and interfacing in a Grid environment. The applicability of the envisioned biomedical service space is illustrated on a set of bioinformatics resources, addressing computational identification of protein-coding genes.
Keywords. Service-oriented model, open environment, Grid services, ontologies, software agents, semantic integration, bioinformatics resources
1. Introduction A basic characteristic of existing information resources and systems is their availability and accessibility through networks. However, the discovery of information resources or systems that match user-defined criteria is not always trivial. Furthermore, information resources are autonomous and therefore have to be accessed step by step in order to build up the set of information leading to desirable conclusions/results, a fact especially true for biomedical resources. There is a requirement for integrated information environments, incorporating resources able to adequately interact and supply users with semantically described information [1]. Assigning human and machine readable meaning to the contents of the resources, both on the semantics of data and the applications processing these data, would facilitate the automation of processes and enable the creation of a “knowledgeable” space towards offering high quality services, such as the understanding of resource content, improved automation of processes and operations for user support, improved accuracy in classification, search and filtering of 1 Corresponding Author: Vassilis Koutkias, PhD, Lab of Medical Informatics, Aristotle University of Thessaloniki, P.O. Box 323, 54124 Thessaloniki, Greece; E-mail: bikout@med.auth.gr.
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
237
information, embodiment of reasoning schemes, etc. An even more challenging research direction would be the construction of an open and semantically enriched environment for biomedical resources [2]. In this context, the service-oriented computing model, that is based on the principles of service availability through well defined interfaces and service accessibility by systems, is particularly interesting, specifically regarding the automation and interaction of information resources. In general, a service is an entity supplying functionality to other systems (client applications) via appropriate message exchanges [3]. Thus, services constitute a computational and architectural model suitable for distributed and open systems, due to their modular design, based on formal interface descriptions available via a standard Interface Definition Language (IDL). This model defines and makes transparent the means of interaction among service providers and clients via mechanisms for connectivity, communication, service description and discovery, while hiding other details regarding, for example, the physical location and installation of services (service virtualisation). The definition of such services requires the adoption of mutually accepted protocols for connectivity and information exchange among systems. This model applies in Web and Grid services, both supported by XML-based standards such as WSDL (Web Service Description Language), for the description of the connection types and functionality offered by services, and SOAP (Simple Object Access Protocol), for the exchange of XML messages between clients and services. Grid computing environments are open, inhomogeneous, highly dynamic and follow the principles of service-oriented computing that focuses on the basic infrastructure for coordinated resource sharing among Virtual Organisations to achieve high performance and availability [3]. At this time, the OGSI (Open Grid Services Infrastructure) specifications are being re-factored in the context of WSRF 2 (Web Services Resource Framework). WSRF exploits recent developments in Web services architecture to express concepts and interfaces that were initially proposed in OGSI specifications. Likewise, Grid services conform to the technical description of WSRF. Nevertheless, the aforementioned standards do not make full use of the service model’s potential and opportunities [4]. The service-oriented architecture offers a frame which can be extended, incorporating more efficient service representations and dynamic workflow management techniques (i.e., service choreographies/orchestrations [5]), coping this way with services’ heterogeneity and autonomy [1], [2]. It is argued that the presently available Grid computing infrastructure is quite complex and, consequently, several enhancements are required towards semantic interoperability among services, efficient service discovery and utilisation [6], etc. In this context, we propose an agent-based framework that allows for dynamic composition of an open and integrated service space via mechanisms for semantic description and registration [7]. Advanced access to services is enabled through ontology-based matchmaking of service requests and appropriate intermediary agents capable of interacting with the services. In the following, we discuss the impact of the service-oriented model in biomedicine and describe the core technologies and components of the proposed framework. An application paradigm is presented for an ensemble of biological resources addressing computational identification of protein-coding genes.
2
http://www.globus.org/wsrf/
238
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
2. The Service-oriented Model in Biomedicine Despite the obvious advantages of the service-oriented model, Web services have not been widely applied in the field of health service provision or health applications in general, possibly due to the medical industry’s scepticism related to adopting additional standards. References on Web service approaches for the health sector are quite sparse. On the contrary, in bioinformatics, the service-oriented model is a highly appreciated technology and currently in the front of R&D efforts [8], mainly due to the huge volume of data available via distributed sources, the need for interoperability among biological data and computational resources and the fact that, in contrast to medical resources, most biological information sources offer free access. Among the currently available systems for biological Web services implementation are Soaplab 3 and BioMOBY [9]. Furthermore, the needs for designing semantic services and orchestration mechanisms for automated workflows formulation and execution have been highlighted [5]. Regarding Grid services, there is a significant interest for biological as well as medical applications [10], originated from the powerful computational infrastructure offered. Nowadays, the interest on Grid services is shifting from access to specialised equipment or parallel computing systems to the composition of an integrated and semantically rich cooperative environment enabling complicated scientific experiments or procedures to take place [11] (e.g., large-scale epidemiological studies, genetic analyses, etc.). Looking more closely at the tools and applications in the bioinformatics field, it is obvious that rapid developments take place, however much effort is still required in order to formulate an integrated, semantically enabled working environment. Most data repositories and analysis tools currently available are accessible through public Web interfaces and typically without any registration requirements. As the amount of annotated biological data grows, biological data analysis becomes more sophisticated and solutions to a wide range of computational problems are addressed by multiple Web resources, constituting a rapidly evolving and competitive environment. These progressively applied enhancements are accompanied with technological efforts that address heterogeneity problems indicated in both the vocabulary used and the lack of a standardised way to define resource requirements. In principle, most bioinformatics resources do not support modularity, and usually interoperability capabilities have not been considered in their specifications. This makes integration processes vulnerable in whatever updates or modifications of each resource incorporated in the integration scheme [12]. Moreover, resource usability and integration are further complicated when addressing computer-intensive biological problems. In specific bioinformatics application areas, the reliability of the information extracted by the algorithmic techniques that underlie data analysis tools is controversial and, therefore, additional evaluation and cross-validation techniques are necessary to improve computer-aided resource performance. Considering the above-mentioned resource features, it becomes evident that migration from conventional Web-based applications to semantically enriched serviceoriented architectures will be particularly beneficial in biomedicine and such a scenario is presented in the following sections.
3
http://www.ebi.ac.uk/soaplab/
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
239
3. Core Technologies and System Components In our approach, the construction of an integrated biomedical service space involves the definition of appropriate semantic schemas that enables the description of services upon registration/addition to the system. Services are registered to the virtual space following the descriptions of the semantic schema, which includes functional and technical service features. The functional features involve the underlying operational model associated with each service and the potential data model applied, while the technical features involve I/O service operations and particular service characteristics. Incorporating in this schema functional as well as technical service features, enables advanced selection of services’ operations upon requests by applying matchmaking procedures among the criteria defined and the registered services. For the definition of the semantic service schema we follow an ontological approach.
Figure 1. Proposed approach for the construction of a semantically enriched biomedical service space
The description of services at semantic level corresponds to the construction of a “Knowledge-Base of Services” or a “Semantic Directory of Services”. This semantically enriched environment is managed by applying software agent technology. Software agents constitute a favourable approach for heterogeneous system integration and incorporation of advanced functionalities in open environments like Grids [6], [13], [14]. In particular, in order to enhance the interfacing capabilities of the constructed biomedical service space, we apply software agents as the intermediaries among the services and the external world. Upon registration of a service, a Service-Proxy agent is constructed automatically, capable of interacting with the service by translating SOAP messages, dominant in the service-oriented environment, to ACL (Agent Communication Language) messages and vice-versa. ACL messages encapsulating service operation invocations are generated by a dynamically constructed ServiceClient agent. Agent messages exchanged between the Service-Proxy and the Service-Client agents are encoded in an application ontology, also constructed upon service registration, which incorporates the technical description of the service, in terms of its operations’ parameters (type and number) and names, as well as the returned type of data. This shared ontology ensures that terms have clear and consistent semantics.
240
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
The conceptualisation for the construction of a uniform, semantically enriched biomedical service space is illustrated in Figure 1. Considering a usage scenario, services are registered by Service Providers (SPs) and the entire service space is managed by an Administrator, who utilises tools for interface agents’ construction. In a real scenario, the domain ontology would be constructed by Knowledge Experts or the Administrator of the service space. In the following, an application paradigm related to the construction of a bioinformatics service space is presented.
4. An Application Paradigm with Bioinformatics Resources Bioinformatics provides an ideal research area to employ the service-oriented model, since a wide range of heterogeneous and distributed resources are currently available offering diverse functionalities that need to be integrated within global solution frameworks. An example application category, particularly suitable to evaluate the service-oriented model, is computational identification of protein-coding genes within query DNA sequences. A survey by Mathé et al. [15] reported 49 gene prediction tools, most of which are freely accessible through user-friendly Web interfaces and highly heterogeneous in the way they structure and encode I/O data, offering diverse levels of parameterisation that further complicate machine-readability and evidence combination. Based on the type of evidence exploited, computational gene prediction is performed by similarity-based and ab-initio methods. Similarity-based gene finders identify protein-coding regions by comparing the query sequence against databases of annotated proteins, cDNA/ESTs (Expressed Sequences Tags), etc. Ab-initio gene prediction tools exclusively make use of the intrinsic information of the query DNA sequence, i.e., specific compositional and structural features, by implementing various probabilistic techniques that identify highly probable coding features based on species-specific gene models. The algorithmic approach that underlies gene prediction tools is an important factor in obtaining accurate predictions of the complete gene assembly. Various assessments on the prediction accuracy levels concluded that ab-initio gene finders may be highly effective for a specific species data model, sequence length and base composition, while fail to delineate the exact gene boundaries in other cases [16]. Thus, it becomes evident that using multiple gene prediction resources can be especially valuable in improving quality of the resulting outcome and, therefore, migration to a service-oriented framework supporting efficient registration, matchmaking and coordination mechanisms is a challenging task. In the following, we elucidate the basic components of the semantic schema and describe the implemented procedures addressing service registration, interfacing and matchmaking on a set of gene prediction services.
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
241
Figure 2. Basic classes of the semantic schema describing gene prediction services and the instances and attributes of AbInitioService class corresponding to ab-initio gene prediction services
4.1. Semantic Description of Services Following the specifications of the generic biomedical service space presented above, we initially conceptualised the semantic schema of the candidate resources in terms of their functional and technical characteristics. Specifically: a) Functional analysis involves an ontology-based description of the type of prediction performed and data used within the: Functional model: determines the type of analysis performed, i.e., ab-initio prediction or similarity-based identification through searching protein, cDNA/EST databases, etc. Data model: refers to characteristics such as the supported species-specific model, accuracy cut-off values, etc.
242
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
Figure 3. Service registration GUI for gene prediction services that is dynamically generated according to the semantic schema defined
b) Technical characteristics are also investigated and described in the ontologybased schema regarding: I/O parameters: capture data concerning input data format e.g., FASTA, GCG, etc., and the type of information extracted, i.e., result format and content. Service invocation: contains information on how to access and retrieve data as defined in the corresponding WSDL descriptions, i.e., gene prediction service operation and parameters. Figure 2 illustrates a screenshot of Protégé knowledge modelling tool [17], containing the basic classes of the semantic schema developed aiming to describe gene prediction services (Class Browser panel), as well as the instances (Instance Browser panel) and attributes (Class Editor panel) of the class corresponding to the ab-initio services registered in the constructed service space. For a set of tools defined in AbInitioService class, the corresponding attributes are depicted addressing both functional and technical requirements. 4.2. Service Registration Let us assume that a SP registers a gene prediction service in the “Semantic Directory of Gene Prediction Services”. For this purpose, an appropriate user interface is created dynamically, containing the features encapsulated in the domain ontology for gene
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
243
prediction services description, as illustrated in Figure 3. Thus, the SP has to provide functional information, e.g., species models, predicted peptide description, sequence format, etc., as well as technical service features, e.g., WSDL URL, prediction operation and parameters, etc. It has to be noted that the registration procedure may involve not only the generation of instances in the domain ontology, but also potential schema extensions, e.g., by adding new (not listed) formats for the input sequence (Figure 3). 4.3. Service Interfacing After a service registration, the corresponding Service-Proxy and Service-Client agents (as well as the ontology they share) are generated dynamically, via actions taken by the service space Administrator, which enable service invocations from other agent-based systems, enabling reusability and extension of the constructed service environment and preserving openness as a design principle. In Figure 4(a), the ontology created for the GENSCAN gene prediction service is illustrated. Figure 4(b) depicts the correspondence between the WSDL description and the ontology, regarding the getGenepredict service operation and the GenscanGetGenepredictAgentAction subclass of the ontology, corresponding to the action that the Service-Proxy agent has to perform in case the getGenepredict operation is invoked. Note, for example, that operation parameters in0, in1, in2 and in3 correspond to attributes in class GenscanGetGenepredictAgentAction of the ontology and the relevant type among them. 4.4. Service Matchmaking The semantically-enriched gene prediction service space enables efficient matchmaking among user requests and services available. In this work, however, we mainly focus on the construction of the service space. We propose that the matchmaking mechanism in such a case relies on an agent brokering protocol incorporated in the reasoning of a Broker agent [7]. The Broker agent matches the analysis criteria defined (or implied) by the user and the instances stored in the service space according to the semantic schema defined. In other words, it operates as an ontology-based matchmaker. In case of successful service matches, the Broker agent invokes the corresponding service(s) via Service-Client agent(s) to corresponding Service-Proxy agent (s) requests, retrieves the analysis results obtained and provides them to the end user.
244
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
(a)
(b) Figure 4. (a) Application ontology for interfacing with the GENSCAN service and (b) part of GENSCAN’s WSDL file related to the getGenepredict operation (note the correspondence between the getGenepredict operation and the GenscanGetGenepredictAgentAction ontology subclass)
The matchmaking procedure and the relevant brokering protocol are described in more detail in [18], where generic multiagent architectures are presented to address diverse integration scenarios of bioinformatics resources. 4.5. Implementation Details Following the WSRF framework, Grid and Web services converge. Since the WSRF specifications have not been standardised yet, we consider Web services as an alternative to Grid services in our design and implementation. After a simple analysis of the requirements and constraints introduced by each resource, we came up with a set of common features upon which we transformed wrapper applications of conventional Web-based tools into Web services for 12 similarity-based and ab-initio finders [19]. Software agents were constructed using JADE (Java Agent Development Framework), a widely-known open source toolkit for agent-based systems development and execution [20]. The semantic schema for describing services as well as the ontology for agent interfacing with the services was constructed using Protégé. The
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
245
interface agents (Socket-Proxy and Socket-Client), as well as the application ontology encoding messages of their communication, were constructed based on the WSDL2AGENT methodology that provides the conceptualisation and an implementation model for accomplishing Web service invocations from agent systems [21].
5. Discussion The service-oriented model is expected to play a significant role in next generation distributed computing architectures [7]. Lately, there is a constantly growing interest on this model, especially for implementing intra-organisational applications, constituting a new approach towards developing distributed systems based on the client/server model and appropriate communication standards. Contrary to other distributed system technologies, service-oriented models are independent of the execution platform and programming language, make use of HTTP-based message exchange, and enable loosely-coupled connections, supporting server-client independency. In the work presented, the service-oriented model has been adopted and a uniform, semantically-enriched biomedical service space has been conceptualised. The approach followed is based on semantic description of services and adoption of agent technology to control interactions in an open environment. An application paradigm related to the construction of a bioinformatics service space has been elaborated, since the serviceoriented model is highly appreciated and adopted in this field. The proposed framework in its extent may be conceptualised as a Virtual Organisation consisting of Service Providers, Service Consumers and Service Matchmakers. The current effort constitutes a further step towards extending the service-oriented model’s potential in the direction of composing a space where services are dynamically registered, annotated and accessed. As a future work direction, we plan to develop a large-scale biomedical service space by federating application-specific semantically enriched service spaces, like the one demonstrated for computational identification of protein-coding genes, and survey its robustness and flexibility while scaling its openness. In such an environment it is interesting to elaborate on the formulation and enactment of service choreographies among different application areas, in order to combine functionalities and perform complex analysis workflows. We are also interested in adopting standard representations of service ontologies in our implementation, such as OWL-S and WSMO (Web Service Modelling Ontology) [7]. Up to now, service technologies are well-established and the potential of building more generic service environments of horizontally and vertically integrated resources capable of providing advanced functionalities is becoming more realistic. We believe that Grid-enabled infrastructures may provide convenient and powerful mechanisms towards these efforts, and the deployment of service-centric applications into Virtual Organisations that enable secure and convenient resource mapping and management may be highly effective in extending the application range of the proposed approach.
246
V. Koutkias et al. / Constructing a Semantically Enriched Biomedical Service Space
References [1] D. de Roure, N.R. Jennings, N.R. Shadbolt, The Semantic Grid: A future e-science infrastructure, Grid Computing - Making the Global Infrastructure a Reality, Wiley & Sons (2003), 437-470. [2] C.A. Goble, D. de Roure, N.R. Shandbolt, A.A.A. Fernades, Enhancing services and applications with knowledge and semantics, The Grid: Blueprint for a new computing infrastructure (2nd edition), ser. Grid Computing, Elsevier (2004), 431-458. [3] I. Foster, C. Kesselman, S. Tuecke, The anatomy of the Grid: Enabling scalable Virtual Organizations, International Journal of High Performance Computing Applications 15(3) (2001), 200-222. [4] H. Wang, J.Z. Huang, Y. Qu, J. Xie, Web services: Problems and future directions, Journal of Web Semantics 1 (2004), 309-320. [5] R. de Kniffer et al., A Web services choreography scenario for interoperating bioinformatics applications, BMC Bioinformatics 5(25) (2004). [6] M.O. Shafiq, H.F. Ahmad, H. Suguri, A. Ali, Autonomous Semantic Grid: Principles of Autonomous Decentralized Systems for Grid Computing, IEICE Transactions on Information and Systems, E88– D(12) (2005), 2640-2650. [7] M.P. Singh, M.N. Huhns, Service-oriented computing: Semantics, processes, agents, Wiley & Sons, 2005. [8] H.T. Gao, J.H. Hayes, H. Cai, Integrating biological research through Web services, IEEE Computer 38(3) (2005), 26-31. [9] M.D. Wilkinson, M. Links, BioMOBY: An open-source biological Web Services proposal, Briefings in Bioinformatics 3(4) (2002), 331-341. [10] V. Breton et al., The Healthgrid White Paper, Proc. of HealthGrid 2005, Studies in Health Technology and Informatics, IOS Press 112 (2005), 249-318. [11] I. Foster, C. Kesselman, Concepts and architecture, The Grid: Blueprint for a new computing infrastructure (2nd edition) ser. Grid Computing, Elsevier (2004), 37-63. [12] L. Stein, Creating a bioinformatics nation, Nature 417 (2002), 19-20. [13] I. Foster, N.R. Jennings, C. Kesselman, Brain meets brawn: Why Grid and agents need each other, Proc. of the 3rd Int. Joint Conference on Autonomous Agents and Multi-Agent Systems, New York, USA (2004), 8-15. [14] L. Moreau et al., On the use of agents in a bioinformatics grid, Proc. of the 3rd CCGRID (2003), Tokyo, Japan. [15] C. Mathé, M.F. Sagot, T. Schiex, P. Rouzé, Current methods of gene prediction, their strengths, and weaknesses, Nucleic Acids Research 30 (2002), 4103-4117. [16] S. Rogic, A.K. Mackworth, F. Ouellette, Evaluation of gene-finding programs on mammalian sequences, Genomic Research 11 (2001), 817-832. [17] N.F. Noy et al., Creating Semantic Web contents with Protege-2000, IEEE Intelligent Systems, 16(2) (2001), 60-71. [18] V. Koutkias, A. Malousi, N. Maglaveras, Engineering agent-mediated integration of bioinformatics analysis tools, Proc. of 1st Int. Workshop on Multi-Agent Systems for Medicine, Computational Biology, and Bioinformatics, 4th Int. Joint Conference on Autonomous Agents and Multi-Agent Systems, Utrecht, The Netherlands (2005), 122-136. [19] V. Koutkias, A. Malousi, N. Maglaveras, Performing ontology-driven gene prediction queries in a multi-agent environment, Proc. of ISBMDA, Lecture Notes in Computer Science, Springer-Verlag 3337 (2004), 378-387. [20] F. Bellifemine, F. Bergenti, G. Caire, A. Poggi, JADE - A Java Agent Development Framework, MultiAgent Programming: Languages, Platforms and Applications, ser. Multiagent Systems, Artificial Societies, and Simulated Organizations, Springer (2005), 125-147. [21] L.Z. Varga, A. Hajnal, Engineering Web Service invocation from agent systems, Proc. of CEEMAS, Lecture Notes in Artificial Intelligence, Springer-Verlag 2691 (2003), 626-635.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
247
Building a European Biomedical Grid on Cancer: The ACGT Integrated Project M. Tsiknakisa , D. Kafetzopoulosb, G. Potamiasc, A. Analytid, K. Mariasc, A. Manganasc a
Center for eHealth Technologies, Institute of Computer Science, FORTH, Crete, Greece b Post Genomic Technologies Laboratory, Institute of Molecular Biology and Biotechnology, FORTH, Crete, Greece c Biomedical Informatics Laboratory, Institute of Computer Science, FORTH, Crete, Greece d Information Systems Laboratory, Institute of Computer Science, FORTH, Crete, Greece Abstract. This paper presents the needs and requirements that led to the formation of the ACGT (Advancing Clinico Genomic Trials) integrated project, its vision and methodological approaches of the project. The ultimate objective of the ACGT project is the development of a European biomedical grid for cancer research, based on the principles of open access and open source, enhanced by a set of interoperable tools and services which will facilitate the seamless and secure access to and analysis of multi-level clinico-genomic data, enriched with highperforming knowledge discovery operations and services. By doing so, it is expected that the influence of genetic variation in oncogenesis will be revealed, the molecular classification of cancer and the development of individualised therapies will be promoted, and finally the in-silico tumour growth and therapy response will be realistically and reliably modelled. Its main design decisions and results at its current stage of development are presented.
Keywords. Biomedical grids, Semantic data mediation and integration, Data mining and knowledge discovery on the Grid, Cancer research
1. Introduction This is a critical time in the history of cancer research as recent advances in methods and technologies have resulted in an explosion of information and knowledge about cancer and its treatment. As a result, our ability to characterize and understand the various forms of cancer is growing exponentially, and cancer therapy is changing dramatically. Today, the application of novel technologies from proteomics and functional genomics to the study of cancer is slowly shifting to the analysis of clinically relevant samples such as fresh biopsy specimens and fluids, as the ultimate aim of translational research is to bring basic discoveries closer to the bedside. The implementation of discovery driven translational research, however, will not only require co-ordination of basic research activities, facilities and infrastructures, but
248
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
also the creation of an integrated and multidisciplinary environment with the participation of dedicated teams of clinicians, oncologists, pathologists, epidemiologists, molecular biologists, as well as a variety of disciplines from the domain of information technology. Today, information arising from post-genomics research, and combined genetic and clinical trials on one hand, and advances from high-performance computing and informatics on the other is rapidly providing the medical and scientific community with new insights, answers and capabilities. The breadth and depth of information already available to the research community at large, presents an enormous opportunity for improving our ability to reduce mortality from cancer, improve therapies and meet the demanding individualization of care needs. A critical set of challenges, however, currently inhibit our capacity to capitalize on these opportunities [1]. Much of the genomic data of clinical relevance generated so far are in a format that is inappropriate for diagnostic testing. Very large epidemiological population samples followed prospectively (over a period of years) and characterized for their biomarker and genetic variation will be necessary to demonstrate the clinical usefulness of these tools. Up to now, the lack of a common infrastructure has prevented clinical research institutions from mining and analyzing disparate data sources. This inability to share technologies and data developed by different cancer research institutions can therefore severely hamper the research process. Similarly, the lack of a unifying architecture is proving to be a major roadblock to a researcher’s ability to mine different databases. Most critically, however, even within a single laboratory, researchers have difficulty integrating data from different technologies because of a lack of common standards and other technological and medico-legal and ethical issues. As a result, very few cross-site studies and clinical trials are performed and in most cases it isn’t possible to seamlessly integrate multi-level data (from the molecular to the organ, individual and population levels). In conclusion, clinicians or molecular biologists often find it hard to exploit each other’s expertise due to the absence of a cooperative environment which enables the sharing of data, resources or tools for comparing results and experiments, and a uniform platform supporting the seamless integration and analysis of disease-related data at all levels. 1.1. GRID Computing Grid computing enables the virtualization of distributed computing over a network of heterogeneous resources giving users and applications seamless, on demand access to vast IT capabilities [2]. Grid computing provides a novel approach to harnessing distributed resources, including applications, computing platforms or databases and file systems. Applying grid computing can drive significant benefits by improving information access and responsiveness, and adding flexibility, all crucial components of solving the data warehouse dilemma. Rather than bringing data to a data warehouse where it sits waiting to be used, a federated solution can maintain the data at its points of origin. Federated solutions help to address the size and complexity of data warehouses by applying a logical model to the existing physical infrastructure instead of imposing a new data warehouse environment. Information grid technology—which gives users and applications security-rich access to virtually any information source, anywhere, over any type of network—supports sharing of data for processing and large-scale collaboration. It also helps bring the federated model to distributed and complex data sources.
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
249
Grid computing, also, introduces a new concept to IT infrastructures because it supports distributed computing over a network of heterogeneous resources and is enabled by open standards. As a result, new and innovative approaches are evolving for harnessing the vast and unused computational power of the world's computers and direct it at research designed to help unlock genetic codes that underlie diseases like cancer, AIDS and Alzheimer's.
2. A European biomedical Grid infrastructure for Clinical Trials on Cancer: The ACGT Vision Within such a context, the implementation of the EU funded Integrated Project named “Advancing Clinico-Genomic Trials on Cancer: Open Grid Services for Improving Medical Knowledge Discovery”, with the acronym ACGT, is beginning. The ultimate objective of the ACGT project is the provision of a unified technological infrastructure which will facilitate the seamless and secure access and analysis of multi-level clinico-genomic data enriched with high-performing knowledge discovery operations and services (see Fig. 1). In so doing, ACGT aims to contribute to (a) the advancement of cancer research for revealing the influence of genetic variation in oncogenesis, (b) the promotion of molecular classification of cancer and the development of individualised therapies, and (c) the development of realistic and reliable in-silico tumour growth and therapy response models (for the avoidance of expensive and often dangerous examinations and trials on patients) [3]. The real and specific problem that underlies the ACGT concept is “co-ordinated resource sharing and problem solving in dynamic, multi-institutional, Pan-European virtual organisations”. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or organisations defined by such sharing form is what we call the ACGT virtual organisation (VO). The project intends to eventually become a Pan-European community of voluntary participants from national and trans-national biomedical research fields, given the benefits of open access to a rich source of interoperable tools, shared data and standards developed by the BioMedical Informatics (BMI) research community. The basic principles on which ACGT is basing its R&D and service delivery vision are: x Clinical Research Organisations (CROs) will continue to retain their independence, whilst their collaboration with each other will be determined by their interests; x The technological infrastructure deployed by the various CROs will be different, hence heterogeneous; In achieving the above objectives, we envisage a need for: x
highly flexible and dynamic sharing relationships. The dynamic nature of sharing relationships means that we require mechanisms for discovering and characterising the nature of the relationships that exist at a particular point in time. For example,
250
Organ
Tissue
Cell
Clinic - Specialty -
Radiology - Imaging -
Histopathology Organ Modeling - Simulations -
Immunochemistry e-Cell modeling Molecular imaging
Mol. Pathway Gene/Protein Interactions
RNA DNA
ACGTTCGCT
Proteomics
Functional Genom ics
Genomics
INFOrmatics
System
MEDicine
x
a new participant joining a VO must be able to determine what resources it is able to access, the “quality” of these resources, and the policies that govern access; sophisticated and precise levels of control over how shared resources are used, including fine-grained and multi-stakeholder access control, delegation, and application of local and global policies; sharing of varied resources, ranging from programs and data to computers; diverse usage models, ranging from single user to multi-user and from performance sensitive to cost-sensitive.
BIOlogy
x
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
Figure 1: The envisioned ACGT GRID-enabled infrastructure and integrated environment – integration to be achieved at all levels, from the molecular to system and to the population.
Consequently, ACGT will create and test an infrastructure for cancer research by using a virtual web of trusted and interconnected organizations and individuals to leverage the combined strengths of cancer centres and investigators and enable the sharing of biomedical cancer-related data and research tools in a way that the common needs of interdisciplinary research are met and tackled. Furthermore, ACGT intends to build upon the results of several biomedical Grid projects and initiatives, such as the caBIG [4], BIRN [5], MEDIGRID [6], MyGRID [7] and DiscoverySpace [8]. The project focuses on the semantically rich problems of dynamic resource discovery, workflow specification, and distributed query processing, as well as provenance management, change notification, and personalization. The infrastructure work in ACGT contains the following main components: ÖBIOMEDICAL TECHNOLOGY GRID LAYER: This layer comprises the basic “Grid engine” for the scheduling and brokering of resources. This layer enables the creation of “Virtual Organisations (VO)” by integrating users from different and
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
251
heterogeneous organisations. Access rights, security (encryption), trust buildings are issues to be addressed and solved on this layer, based on system architectural and security analysis. ÖDISTRIBUTED DATA ACCESS AND APPLICATIONS: In order to provide seamless and interoperable data access services to the distributed data sources, a set of compatible software key modules/services will be developed based on Web Services. These services will provide ontology-based ubiquitous interoperability among the integrated ACGT environment and other types of heterogeneous information systems, i.e. clinical, LIMS, microarray, SNP/genotyping, etc. ÖDATA MINING AND KNOWLEDGE DISCOVERY TOOLS: The “Data mining and Knowledge Discovery Services” layer includes open data mining and data analysis services. ACGT will devote significant effort towards the design, development and deployment of open, interoperable data mining and analysis software tools and services. The ultimate goal is to offer a GRID-enabled Knowledge Discovery Suite [9] for supporting discovery operations from combined clinico-genomic biomedical data. ÖONTOLOGIES AND SEMANTIC MEDIATION TOOLS: Formalised knowledge representations (ontologies) will play a key role in any future biomedical Grid on cancer research. This creates the requirement for adopting/extending or even constructing an ontology for the particular disease under investigation. By building on the various ontologies and controlled vocabularies that have grown over the years for providing a shared language for the communication of biomedical information (e.g., the Gene Ontology (GO), the MGED Ontology, the NCI Thesaurus and Metathesaurus, the UMLS Metathesaurus, etc.), ACGT is devoting significant R&D effort to the task of constructing a shared ontology for the disease under investigation. ÖTECHNOLOGIES AND TOOLS FOR IN-SILICO ONCOLOGY: ACGT will demonstrate its added value for the in-silico modelling of tumour growth and therapy response. The aim being to develop open tools and services for the four dimensional, patient specific modelling and simulation of the biological activity of malignant tumours and normal tissues in order to optimize the spatiotemporal planning of various therapeutic schemes. Ultimately, the aim of this activity is to contribute to the effective treatment of cancer and to contribute to the understanding of the disease at the molecular, cellular, and higher level(s) of complexity. ÖTHE INTEGRATED ACGT ENVIRONMENT: Integration of applications and services will require substantial meta-information on algorithms and input/output formats if tools are supposed to interoperate. Assembly of tools for virtual screening into complex workflows will only be possible if data formats are compatible and semantic relationship between objects shared or transferred in workflows are clear.
3. R&D Challenges and the ACGT approach A major part of the project is devoted to research and development in infrastructure components that eventually will be integrated into a workable demonstration platform upon which the selected, and those to be selected during the lifecycle of the project, Clinical Trials can be demonstrated and evaluated against user requirements defined at the onset of the project.
252
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
ACGT’s vision is to become a pan-European voluntary network of grid connecting individuals and institutions to enable the sharing of data and tools, creating a European Wide Web of cancer clinical research; the ultimate goal being to speed the delivery of innovative approaches for the prevention and treatment of cancer. In realizing this vision the ACGT project must move beyond the current state of the art in a number of domains. Some of the challenges facing ACGT and the approach taken in tackling those challenges are briefly described in the following subsections. 3.1. The ACGT Clinical Trials Today it is recognised that the key to individualizing treatment for cancer lies in translational research, i.e. in finding ways to quickly “translate” the discoveries about human genetics made by laboratory scientists in recent years into tools that physicians can use to help make decisions about the way they treat patients. In Europe there are several ongoing clinical studies related to cancer. However, amongst the different hospitals involved there is heterogeneity in the way patient data is documented while electronic patient records aren’t available in all hospitals. In several ongoing clinical trials “case report forms” (CRFs) are frequently used to record protocol-specified data about patients. The use of distributed approaches introduces significant advantages that should be analysed in the context of clinical trials. The new scenarios of genomic medicine introduce significant new challenges that cannot be addressed with our current methodologies. One of the issues where the ACGT project presents an innovative approach is the design of new, combined clinico-genomic translational clinical trials, enabled by the set of innovative tools and services available to all members of the “virtual organisation” created through the use of GRID. There are three main clinico-genomic trials (C-GT) in the ACGT project. The realization of these trials will be based on a number of scenarios which will act as benchmark references for the development and assessment of the ACGT technology. On the systems’ level, these scenarios will guide the specification, the development and the evaluation of the GRID-enabled ACGT integrated environment and platform. On the clinical and genomics levels, these scenarios will offer clear-cut references for assessing the reliability of the ACGT-based clinico-genomic trials’ outcome. 1. The first C-GT focuses on breast cancer (BC) and addresses the predictive value of gene-expression profiling (based on microarrays and genotyping technology) in classifying (according to induced ‘good’ and ‘bad’ prognostic molecular signatures) and treating breast cancer (BC) patients. 2. The second C-GT focuses on paediatric nephroblastoma or, Wilms tumour (PN) and addresses the treatment of PN patients according to well-defined risk groups in order to achieve the highest possible cure rates, to decrease the frequency and intensity of acute and late toxicity and to minimize the cost of therapy. The main objective of this trial is to explore and offer a molecular extension dimension to PN treatment harmonized with traditional clinico-histological approaches. 3. The third C-GT focuses on the development and evaluation of in silico tumour growth and tumour/normal tissue response simulation models – in silico tumour growth and simulation modelling (IS-TGSM). The aim of this trial is to develop an ‘oncosimulator’ and evaluate the reliability of in-silico modelling as a tool for assessing alternative cancer treatment strategies.
253
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
3.2. The Biomedical Grid and Heterogeneous Data Integration We have selected the Globus Toolkit for the implementation of the grid middleware for building our open grid layer. The Globus Toolkit [10] is an open source software toolkit developed by the Globus Alliance and many others. The Globus Toolkit provides grid services that meet the requirements of the Open Grid Service Architecture and are implemented on top of the Web Service Resource Framework. It includes software for security, information infrastructure, resource management, data management, communication, fault detection, and portability. The most important components of the Globus Toolkit involved in our envisaged grid system is WS-GRAM (Web Services – Grid Resource Allocation & Management) for job execution, MDS4 (Monitoring & Discovery System) and GSI (Grid Security Infrastructure). Other technologies that will be included in ACGT are Globus security and OGSA-DAI as a grid data layer for exposing data services. The OGSA-DAI data service is responsible for accessing and retrieving clinical and genomic information from the corresponding information systems [11]. A critical feature of ACGT however is to create semantic interoperability between data resources. As part of the grid architecture, the ACGT Master Ontology is a central part in creating the semantic interconnection. Classical approaches to database integration [12] include techniques such as wrappers or virtual conceptual schemas. Ontologies are a relevant method for database integration and, in fact, many current projects and proposals are evolving towards ontology-based methods. By using these ontology-based approaches, developers can map, for instance, objects belonging to a specific database to concepts of a shared ontology or biomedical vocabulary. Our approach to heterogeneous data integration is based on a mediator-wrapper architecture enabled by the use of ontologies/metadata (see Fig. 2).
Knowledge & Discovery Tools MO answer
Ontology service
MO query
External Ontologies
uses
Mediator service
Alignment produces alignment Tool ACGT Master Ontology Mapping Tool
produces
- query translation - query optimization - query decomposition - subquery scheduling - answer translation - answer composition
LO answer
Ontologies & Semantic Mediation Tools
LO query mapping
Source Description (Local Ontology)
uses
Wrapper
Wrapper
service
service
LS answer
Basic GRID technology and security MO: ACGT Master Ontology LO : Local (source) Ontology LS : Local (source) Schema DS : Data Source
LS query
DS1
DSn
Figure 2: Heterogeneous data integration in ACGT
254
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
In particular, the mediator will integrate heterogeneous data sources (which can be ACGT databases, external databases, web sources, web data services) by providing a virtual view of all this data. Users (including ACGT tools or services) forming queries to the mediated system do not have to know about data source location, schemas, or access methods, since the system will present one shared mediator ontology (the ACGT master ontology) to the user and users will form their queries using its terms. In order for the mediator to integrate the various heterogeneous data sources, their object models, terminologies, embedded domain ontologies, hidden semantic information, query capabilities, and security information will be analysed. Based on this analysis, a source description will be defined consisting of a local ontology along with a set of metadata, specifying query capabilities and security information. Thus, a source description is an abstraction of a particular data source, possibly conveying semantic information not present in the data source schema. Wrappers are software components, providing source dependent data services to the mediator. Each wrapper receives queries from the mediator in terms of the local ontology, transforms them in the format of the underlying data source, submits the query to the data source, and translates the results back to the local ontology schema. Thus, a wrapper hides technical details of the data source from the mediator. When the mediator receives a query from the user in terms of the ACGT master ontology, it decomposes the query into subqueries in terms of the local ontologies by taking into account the source descriptions, and sends them to the corresponding wrappers. Then, upon receiving the answers from the wrappers, it translates them in terms of the ACGT master ontology, combines the results, and sends the final answer to the user. Thus, the mediator has to perform the following subtasks: query translation, query optimization, query decomposition, subquery scheduling, answer translation, and answer composition. In summary, there are two conceptual translation steps: 1.
from the ACGT master ontology to local ontologies and vice-versa,
2.
from the local ontologies to source schemas, and vice-versa.
These two steps are performed by the mediator and wrapper components, respectively. Of course, these translation steps require the establishment of the relevant (semantic) mappings. In particular, for translation step 1, concepts and relationships in the ACGT master ontology should be related to these in the local ontologies through a mapping tool. In addition, in the case that a local ontology is linked with an external ontology then the ACGT master ontology should also be mapped and aligned with that external ontology, through an alignment tool. For translation step 2, the local ontologies should be mapped to the schema of the corresponding data source. In the case that the source offers limited query capabilities, methods should be written implementing the functionality described in the source description. 3.3. An Ontology for Clinical Trials on Cancer The huge amount of heterogeneous data that genomic and epidemiological researchers share has generated important challenges for information access, integration and analysis that biomedical informaticians must address. In biomedicine, there are no current ontologies that integrate both genomic and clinical data as they are actually needed in topics such as information retrieval or data
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
255
mining. For these expectations to be achieved, more domain ontologies should be developed in biomedicine. While there are great expectations for the future, with extended opinions proposing to develop large biomedical ontologies, there is also a need for pragmatism. Currently, the ontologies (or vocabularies) used (such as the UMLS, which now includes Gene Ontology) in biomedical informatics are plagued with technical problems that need to be solved by software engineers. However, there is a vast amount of real problems that can benefit from using these ontologies or vocabularies. For instance, (i) designing new models for biobanks, i.e. databases that include both clinical and genomic data, (ii) the unification of databases including information such as Single Nucleotide Polymorphisms (SNPs) and pathways, (iii) using clinical data in drug discovery, (iv) improving data mining and searching, and many others. It is doubtful that ontological research will have a significant impact “per se” in achieving outstanding scientific advances in biomedical informatics. To have realistic chances of success, it will need to link achievements in ontological research to BMI methods and procedures, as well as to consider and address actual BMI research issues [13]. The key semantic integration architectural objectives in ACGT include: Öthe development of semantic middleware technology, enabling large-scale (semantic, structural, and syntactic) interoperation among biomedical resources and services on an as-needed basis; Öthe development of a shared mediator ontology, the ACGT Master Ontology, through semantic modeling of biomedical concepts using existing ontologies and ontologies developed for the needs of the project. Öthe mapping of local conceptual models (clinical, genomic) to the shared ontology while checking consistency and integrity of the mapped information; Öthe development of a semantic-based data service registry to allow advertisement and discovery of data services on the grid. Such a registry will allow ACGT clients to discover data services that have a particular capability or manage a particular data source; Öthe semantic annotation and advertisement of biomedical resources, to allow metadata-based discovery and query of biomedical resources by users, tools, and services; Öthe descriptions of wet lab experiments, in silico experiments, and clinical trials augmented with metadata so as to provide adequate provenance information for future re-use, comparison, and integration of results. In particular, the development of the ACGT master ontology involves the analysis of (i) the ontological needs of the ACGT clinical scenarios and (ii) the ontological foundations and coverage of the existing terminologies and ontologies in the biomedical domain, such as the National Cancer Institute (NCI) Thesaurus [14] and other Open Biomedical Ontologies (OBO) [15]. Based on the analysis of (i) and (ii), it will become possible to craft an ontology that is able to function as a semantic mediator between all systems to be integrated. The ontology should satisfy the conceptual demands of the IFOMIS’ Basic Formal Ontology (BFO) [16] and its domain dependent Medical Ontology (MedO). All entities in the ontology must be given a formal definition. The representation will be in the form of classes defined on the basis of the particulars that are instantiated by them. This master ontology will contain classes
256
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
which are instantiated by particulars of various levels of granularity ranging from molecules, subcellular components, cells and organs to organisms and populations. Also the relationships required to link the entities in a meaningful way will be defined using a first-order logical language. 3.4. eScience Workflows In providing an open, integrated environment for Clinical Trial management using workflows, ACGT will need to accommodate/integrate a vast range of resources in terms of data and applications. These resources may be within an organisation, for example in-house systems at a given clinical research organisation or local tools developed within an academic research group, or may be external services delivered by a public body or accessed across an extranet. The European Bioinformatics Institute (http://www.ebi.ac.uk/services/index.html) alone hosts over 50 tools and 40 databases. The ACGT project has identified key user needs wrt to clinical trial workflows. These are: x Workflow lifecycle: Use of a workflow as part of a scientific endeavour requires support for the workflow lifecycle. x Semantic description of workflows: The workflows (and resources) for a particular clinical trial will not necessarily be known a-priori. Specification at a semantic level of the resources and activities required will allow dynamic discovery of suitable resources (in the context of a European open federation of resource providers and resource consumers) and workflows. x Workflow provenance: Use of workflows as part of scientific activity often require provenance data [17] to be kept about activities performed during workflow execution (e.g. details of specific service providers, versions of data and tools involved, etc). The ACGT master ontology, along with additional service/workflow metadata and ontologies, will also be used for annotating services and ready made workflows (involved in wet lab experiments). The use of ontologies and metadata in wet lab experiments is graphically shown in Fig. 3. Service and workflow annotations will provide information regarding the service interface, functionality, provider, quality of service, etc. Annotated services and workflows are registered in the service/workflow registry, organized in classes. Based on these annotations, and assisted by the service and workflow discovery module, the user should be able to semi-automatically compose new scientific workflows. In a workflow, data and parameters are given as input to the top-level services by the user. Then, their output (possibly combined) is given as input to the next level services, and so on, until the final result is derived by the bottom-level service. The workflow composition component should ensure that the output-input interfaces of the dependent services match. Once a workflow is composed, the user can execute it and store the result in the Wet Lab DB, annotated with (i) metadata and ontology terms describing the result, and (ii) provenance information (service invocation sequence, origin of data, dates, etc.), based on the provenance metadata and ontology.
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
ACGT Master Ontology
Service/Workflow metadata, ontology
257
Service/Workflow discovery uses
uses
Service/Workflow annotation
Service
uses
Workflow
Workflow composition
Annotated Service/Workflow
Workflow
Service/Workflow Registry Service invocation
Wet Lab Experiments
Provenance metadata, ontology
Workflow execution
Service execution
Service result
Workflow uses result Workflow result annotation Annotated workflow result ACGT wet lab DB
Figure 3: The use of ontologies and metadata in wet lab experiments
4. Conclusions ACGT brings together internationally recognised leaders in their respective fields, with the aim to deliver to the cancer research community an integrated ClinicoGenomic ICT environment enabled by a powerful GRID infrastructure. In achieving this objective ACGT has formulated a coherent, integrated workplan for the design, development, integration and validation of all technologically challenging areas of work. Namely: (a) GRID: delivery of a European Biomedical GRID infrastructure offering seamless mediation services for sharing data and dataprocessing methods and tools, and advanced security; (b) Integration: semantic, ontology based integration of clinical and genomic/proteomic data - taking into account standard clinical and genomic ontologies and metadata; (c) Knowledge Discovery: Delivery of data-mining GRID services in order to support and improve complex knowledge discovery processes. The technological platform will be validated in a concrete setting of advanced clinical trials on Cancer. Pilot trials have been selected based on the presence of clear research objectives, raising the need to integrate data at all levels of the human being. ACGT promotes the principle of open source and open access, thus enabling the gradual creation of a European Biomedical Grid on Cancer. Acknowledgements The authors wish to express their gratitude to the whole of the ACGT consortium for their contributions with various ideas on which the ACGT project was developed.
258
M. Tsiknakis et al. / Building a European Biomedical Grid on Cancer
The ACGT project is funded by the European Commission (Contract No. FP6/2004/IST-026996). References [1] Editorial, Making data dreams come true, (2004), Nature 428, 239. [2] I. Foster, The Grid: A New Infrastructure for 21st Century Science, (2002), Physics Today, 55(2):42-47. [3] C. Sander, Genomic Medicine and the Future of Health Care, (2000), Science, 287(5460): 1977-1978. [4] D. Fenstermacher, C. Street1, T. McSherry, V. Nayak, C. Overby, M. Feldman, The Cancer Biomedical Informatics Grid (caBIG™), (2005), Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China. [5] http://www.nbirn.net/ [6] M. Bertero, P. Bonetto, L. Carracciuolo, L. D'Amore, A. Formiconi, M. R. Guarracino, G. Laccetti, A. Murli, G. Oliva, MedIGrid: A Medical Imaging Application for Computational Grids, (2003), International Parallel and Distributed Processing Symposium (IPDPS 2003), 252. [7] http://www.mygrid.org.uk [8] http://www.bcgsc.ca/discoveryspace/ [9] G. Kickinger, P. Brezany, A Min Tjoa, J. Hofer, Grid Knowledge Discovery Processes and an Architecture for Their Composition, (2004), IASTED Conference 2004, Innsbruck, Austria. [10] The Globus Alliance, http://www.globus.org [11] The OGSA –DAI project, http://www.ogsadai.org.uk [12] W. Sujanski, Heterogeneous Database Integration in Biomedicine, (2001), Journal of Biomedical Informatics 34(4):285-298. [13] J. Köhler, S. Philippi, M. Lange, SEMEDA: ontology based semantic integration of biological databases, (2003), Bioinformatics 19(18):2420-2427. [14] National Cancer Institute (NCI) Enterprise Vocabulary Services (EVS), http://www.nci.nih.gov/cancerinfo/ terminologyresources [15] Open Biomedical Ontologies (OBO), http://www.bioontology.org/resources-obo.html [16] P. Grenon, B. Smith, L. Goldbergb , Biodynamic Ontology: Applying BFO in the Biomedical Domain, (2004), in: D. M. Pisanelli (ed.), Ontologies in Medicine, Amsterdam, IOS Press, 20–38. [17] M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, T. Oinn, Provenance of e-Science Experiments - experience from Bioinformatics, (2003), Proceedings UK OST e-Science 2nd All Hands Meeting (AHM'03), Nottingham, UK, 223-226.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
259
Health-e-Child: An Integrated Biomedical Platform for Grid-Based Paediatric Applications Joerg FREUNDa (Project Coordinator), Dorin COMANICIUa, Yannis IOANNISe, Peiya LIUa, Richard McCLATCHEYd, Edwin MORLEY-FLETCHERb, Xavier PENNECf, Giacomo PONGIGLIONEc and Xiang (Sean) ZHOUa On behalf of the Health-e-Child Consortium: a Siemens AG, Erlangen, Germany b Lynkeus SRL, Rome, Italy c IRCCS Giannina Gaslini, Genoa, Italy University College London, Great Ormond St. Hospital, UK Assistance Publique Hopitaux de Paris, France CERN, Geneva, Switzerland Maat GKnowledge, Toledo, Spain d University of the West of England (UWE), Bristol, UK e University of Athens, Greece Universita’ degli Studi di Genova (DISI), Italy f INRIA, Sophia Antipolis, France European Genetics Foundation, Bologna, Italy Aktsiaselts ASPER BIOTECH, Tartu, Estonia Gerolamo Gaslini Foundation, Genoa, Italy Abstract. There is a compelling demand for the integration and exploitation of heterogeneous biomedical information for improved clinical practice, medical research, and personalised healthcare across the EU. The Health-e-Child project aims at developing an integrated healthcare platform for European Paediatrics, providing seamless integration of traditional and emerging sources of biomedical information. The long-term goal of the project is to provide uninhibited access to universal biomedical knowledge repositories for personalised and preventive healthcare, large-scale information-based biomedical research and training, and informed policy making. The project focus will be on individualized disease prevention, screening, early diagnosis, therapy and follow-up of paediatric heart diseases, inflammatory diseases, and brain tumours. The project will build a Gridenabled European network of leading clinical centres that will share and annotate biomedical data, validate systems clinically, and diffuse clinical excellence across Europe by setting up new technologies, clinical workflows, and standards. This paper outlines the design approach being adopted in Health-e-Child to enable the delivery of an integrated biomedical information platform. Keywords. Biomedical informatics, Grid application, heterogeneous data integration, system architecture and design
260
1
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
Background
From DNA sequencing to laboratory testing and epidemiological analysis, clinicians and researchers produce as well as search for information, as part of their daily routine and decision making. Taking advantage of technology has improved dramatically the quality of these activities’ results, facilitating better health-care provision and more advanced biomedical research. Nevertheless, the current state of affairs is still severely restricted with respect to the kind of information that is available to clinicians: x In each case, clinicians focus their activity around a particular genre of information, e.g., genetic information or laboratory test data, therefore, obtaining a rather narrow and fragmented view of the individual patient that they are examining or the disease that they are investigating. x For the most part, they are confined to just using information that they themselves or, in the best case, laboratories in their immediate environment generate. For research, they do access general public data banks (e.g., GenBank), but only a limited number of such resources are actually available. x Especially in paediatrics, longitudinal data (i.e. data taken periodically over time) is usually unavailable and clinicians are forced to operate based on information generated from the current state of their patients. x Given this fragmented nature of the primary information available, opportunities for large-scale analysis, abstraction, and modelling are very limited as well. Hence, any secondary information and value-added knowledge that comes to the hands of the clinicians is equally restricted. x Face-to-face conference meetings as well as reading the literature are the only means in the hands of clinicians for exchanging experiences or obtaining second opinions on rare or unclear cases.
Figure 1: Health-e-Child conceptual design
Hence, despite the undisputed advances in biomedical informatics, the above restrictions paint a rather bleak picture of the overall state of affairs. None of the current long-term targets of the field, e.g., personalised medical care, distributed medical teams, multidisciplinary biomedical research, etc. can be realized given the present level of technology support.
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
261
The last generation of healthcare projects that have been completed in the EC and in the US have demonstrated the viability of data management techniques based on the Grid [1], [2], [3], [4]. They have shown that it is possible for clinicians to share data and processing between Institutions and even across national boundaries, but they have not addressed the constraints of data heterogeneity, the linkage between biological and medical data, the use of discovered knowledge or handling data that evolves as patients change (especially growth in children). The recently initiated Health-e-Child [5] project is the first step in filling the gap between what is current practice and the needs of modern health provision and research. Its goal is to eventually overcome the above constraints of today’s systems and empower clinicians to advance their profession. Figure 1 illustrates the overall conception of Health-e-Child: today’s ever advancing medical sensing technologies generate increasing amounts of information from the children population and encapsulate multiple vertical levels of information from molecular, cell, tissue, to organ, individual, and population level. In particular, the project will develop and test three enabling tool sets for the exploitation of vertically integrated data: disease modelling, decision support systems and services, and knowledge discovery methods and systems. Traditional solutions exist in each of these areas. The novelty of Health-e-Child platform and enabling tool sets lies in the “vertical” aspect: x The disease models are integrated, i.e., having multiple levels of biomedical information as inputs, including genetic information; x The decision support systems utilize all biomedical information available for the patient; x The knowledge discovery modules exploit whatever information is present across multiple heterogeneous databases, including not only traditional but also emerging sources of information, such as molecular or epidemiological data.
2
The Health-e-Child Vision
The vision is for the Health-e-Child system to become the universal biomedical knowledge repository and communication conduit for the future, a common vehicle by which all clinicians will access, analyze, evaluate, enhance, and exchange biomedical information of all forms. Clearly, any effort towards this vision requires significant change in the biomedical information management strategies of the past, with respect to functionality, operational environment, and other aspects. Contrary to current practice, the vision requires that the Health-e-Child system be characterized by the following: 1. Universality of information: Health-e-Child should handle “all’’ relevant medical applications, managing “all’’ forms of biomedical content. Such breadth should be realized for all dimensions of content type: biomedical abstraction (from genetics, to clinical, to epidemiological), temporal correlation (from current to longitudinal), location origin (any hospital/clinical facility), conceptual abstraction (from data to information to knowledge), and syntactic format (any data storage system, from structured database systems to free-text medical notes to images). 2. Person-centricity of information: Health-e-Child should synthesize all information that is available about each person in a cohesive whole. This should form the basis for personalised treatment of the individual, for comparisons among different
262
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
individuals, and for identifying different classes of individuals based on their biomedical information profile. 3. Universality of application: Health-e-Child should comprehensively capture “all’’ aspects of “all’’ biomedical phenomena, diseases, and human clinical behaviours. This includes growth patterns of healthy or infected organic bodies, correlations of genotype/phenotype under several conditions of health, normal and abnormal evolution of human organs, and others. 4. Multiplicity and variety of biomedical analytics: Health-e-Child should provide a rich and broad collection of sophisticated analysis and modelling techniques to address the great variety of specialized needs of its applications. It should synthesize several suites of disease models, decision trees and rule systems, knowledge discovery and data mining algorithms, biomedical similarity measures, ontology integration mappings, and other analytical tools so that clinicians may obtain multi-perspective views of the problems of concern. 5. Person-centricity of interaction: The primary concern of any user interaction with Health-e-Child should be the persons involved. This should be realized at three levels at least. First, the system should facilitate clinicians in identifying or generating easily all information that is pertinent to their activity and should only offer to them support for their decision making and not direct decisions. Second, it should protect the privacy of the person whose data is being accessed and manipulated. Third, it should allow biomedical information exchanges and information-based collaborations among clinicians. 6. Globalness of distributed environment: Health-e-Child should be a widely distributed system, through which biomedical information sources across the world get interconnected to exchange and integrate their contents. 7. Genericity of technology: For economy of scale, reusability, extensibility, and maintainability, Health-e-Child should be developed on top of standard, generic infrastructures that provide all common data and computation management services required. In the same spirit, all integration, search, modelling, and analysis functionality Health-e-Child incorporates itself should be based on generic methods as much as possible. Any specialized functionality should be developed in a customized fashion on top of them. The Health-e-Child proposal aims at developing a first version of the vision. It focuses on key instances of the above, the emphasis of the Health-e-Child effort being on “universality of information” and its corner stone is the integration of information across biomedical abstraction, whereby all layers of biomedical information (i.e., genetic, cell, tissue, organ, individual, and population layer) are vertically integrated to provide a unified view of a person’s biomedical and clinical condition. This drives the research and technology directions pursued with respect to all other target features of Health-e-Child, in particular, temporal and spatial information integration, information search and optimization, disease modelling, decision support, and knowledge discovery and data mining, all operating in a distributed (Grid-based) environment. Each one of these areas presents novel technical challenges, which the current state of the art cannot meet, when the relevant issues are examined in conjunction with vertically integrated biomedical information.
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
3
263
Project Aims and Objectives
The general objectives of Health-e-Child are the following: To gain a comprehensive view of a child’s health by vertically integrating biomedical data, information, and knowledge, that spans the entire spectrum from genetic to clinical to epidemiological; x To develop a biomedical information platform, supported by sophisticated and robust search, optimization, and matching techniques for heterogeneous information, empowered by the Grid; x To build enabling tools and services on top of the Health-e-Child platform, that will lead to innovative and better healthcare solutions in Europe: a. Integrated disease models exploiting all available information levels; b. Database-guided biomedical decision support systems provisioning novel clinical practices and personalised healthcare for children; c. Large-scale, cross-modality, and longitudinal information fusion and data mining for biomedical knowledge discovery. The project focus will be on individualized disease prevention, screening, early diagnosis, therapy and follow-up of paediatric heart diseases, inflammatory diseases, and brain tumours. The project will build a Grid-enabled European network of leading clinical centres that will share and annotate biomedical data, validate systems clinically, and diffuse clinical excellence across Europe by setting up new technologies, clinical workflows, and standards. Paediatrics adds a temporal dimension along which biomedical information changes at different speeds for the different layers of biomedical abstractions, whose vertical integration therefore faces further challenges. The particular diseases in the categories chosen correspond to largely uncharted territories with significant impact expected by any major advances in our understanding of them; they also represent a broad spectrum of technology requirements, thus ensuring genericity and broad applicability of the end result. x
4
Data Integration Philosophy
One important aim of the Health-e-Child project is to provide an integrated healthcare platform for European paediatrics and this paper outlines first ideas in how this platform will be designed. As stated earlier, this platform will enable the modelling and integration of relevant biomedical sources across different diseases or patient levels and the development of a Grid-based service-oriented environment to manage distributed and shared heterogeneous biomedical data and knowledge sources. It will also enable the use of integrated decision support and knowledge discovery systems but this is beyond the scope of this architecture paper. The main data management effort in Health-e-Child is focused on building a comprehensive data, medical information and knowledge-discovery infrastructure for various higher-level components of the Health-e-Child system as recommended in [6], [7]. The design philosophy is founded on four cornerstones: 1. the Grid middleware, which provides the virtual foundation for flexible, secure, coordinated sharing of distributed resources. 2. the modelling and integration of relevant biomedical data sources for improved medical knowledge discovery and understanding
264
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
3.
a Grid-enabled service-oriented gateway that is responsible for data access and management of Health-e-Child acquired and integrated data and 4. a medical query processing environment to provide necessary indexing, search and processing facilities, in the form of algorithms, methods and metrics, for identifying information, knowledge and data fragments that are relevant to a particular request. The biomedical data sources referred to in point 2 cover several vertical levels (from cellular information through organ information to patient and population information) and Health-e-Child will develop data and knowledge models integrating across these levels. This core project effort we refer to as “vertical data integration in the medical domain” and it mainly covers point 2 above. Ontologies [8], [9] will be used to formally express the Health-e-Child medical domain, for improved communication of domain concepts among domain components, and to assist in the integration process. Moreover, ontologies provide semantic coherence of the integrated data model, as ontological commitments will be expected from the Health-e-Child components. In the first few months of the project ontology software and other software technologies are being acquired, tested and evaluated in the context of the project’s user requirements, especially with regard to the integration of clinical data. The ontology-guided semantic integration for generating case data will investigate the following research stages: mapping discovery, the declarative formal representation of and reasoning with mappings. 4.1 Mapping Discovery This stage covers identifying similarities between Ontologies in order to determine which concepts and properties represent similar notions across heterogeneous data samples (semi-) automatically. One of major bottlenecks in generating viable integrated case data is that of mapping discovery. There exist two major approaches to mapping discovery: 1. A Top-down Approach. This approach is applicable to ontologies with a welldefined goal. Ontologies usually contain a generally agreeable upper-level (top) ontology by developers of different applications; these developers can extend the upper-level ontology with application-specific terms. Examples from this approach are Suggested Upper Merged Ontology (SUMO) [10] from the IEEE Standard Upper Ontology Working Group and DOLICE [11]. 2. A Heuristics Approach. This approach uses lexical structural components of definitions to find correspondences with heuristics. For example, [12] describes a set of heuristics used for semi-automatic alignment of domain ontologies to a large central ontology. PROMPT [13] supports ontology merging, guides users through the process and suggests which classes and properties can be merged. FCA-Merge [14] supports a method for comparing ontologies that have a set of shared instances. IF-Map [15] identifies the mappings automatically by the information flow and generates a logic isomorphism [16]. Based on medical ontologies e.g. [17], [18], [19], Health-e-Child will investigate the mapping heuristics for integrated case data. We will evaluate the relative quality of several of these mapping discovery methods for integrated case data. Then we will provide an optimal combination of the best methods with respect to the accuracy and computation-times.
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
265
4.2 Declarative Representation of Mappings This element represents the mapping between ontologies in order to enable reasoning with mappings. The higher expressive power of medical ontology representation language provides opportunities to represent the mappings in more expressed terms. There are several approaches in mapping representations: instance-based [15] and [20], axiom-based [21] and view-based [22]. In axiom-based mapping representation, the correspondence between two ontologies is expressed as a set of formal logical axioms relating to classes and properties of two source ontologies. Thus, logical axioms are essentially deduction rules for bridging two ontologies by relating concepts from one source ontology to another ontology, by using a theorem prover for mapping deduction. In instance-based mapping representation, the correspondences between two ontologies are declaratively represented as transformation functions of instances. Thus, mappings are essentially transformation rules for linking source ontologies to targets. In view-based mapping representation, the correspondences are represented as views, similar to database views definitions in information integration. Thus, the correspondences between two ontologies are defined as a query in terms of views definitions. We will investigate the logic-approach to provide combination of mapping representations methods in a general logic framework. 4.3 Reasoning with Mappings The final stage considers actions on the mappings once they are defined and the identification of types of reasoning that can be developed to generate integrated case data. Many semantic integration tasks based on reasoning have been proposed: ontology projection and extension [21], ontology merging [13] and Description Logics (DL) reasoning [22]. We will investigate the ontology linkage to support the generation of suitable integrated case data.
Figure 2: Initial Health-e-Child Data Architecture
266
5
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
Initial Architecture Design
Figure 2 shows a high level representation of the draft Health-e-Child architecture. The Grid Gateway, the EGEE Grid Middleware and the Grid Data Service Factory (GDSF) represent the main software components of data management. The high-level decision support and knowledge discovery components (DSS and KDD of figure 1) are served by the Health-e-Child Portal, with the Grid Gateway providing the infrastructure for the use services for accessing and sharing of the distributed data and knowledge on the Grid. Data includes Application Models (MDL), Integrated Case Data (ICD), Pattern and Trend data (PAT) retrieved through a Data Access System (DAS). Currently the EGEE software [23] is envisaged to provide the Grid platform since it will provide ‘standardised’ access to components for Grid Access, Monitoring, Job Management, Security and Data Management. Health-e-Child will be designed to provide services which interface to generic Grid services thereby allowing use of EGEE when it becomes available and providing a migration path to future Grids standards as and when they appear, a philosophy that was previously shown to be a success in the MammoGrid project [3]. This would also enable the interoperation of the Health-e-Child application with other Grids middleware as in [6]. Health-e-Child aims to deliver a set of semantically rich models for integrated biomedical data and knowledge. The models produced will be novel in that they will unify semantically remote layers (from genetic through to clinical) thus spanning a domain which has never before been fully captured. This will rely heavily on the expertise of data modellers, computer science experts in data and knowledge modelling, domain experts and clinical partners and will take advantage of the advances made by the INFOGENMED [24] project as described in section 6. It will provide a set of analysis and design models that facilitates the integration of relevant biomedical sources for improved medical knowledge discovery and understanding. The integration and modelling activities of Health-e-Child will take into consideration issues that are typically addressed by data integration projects (in particular we shall also investigate the role of ‘wrappers’ and/or ‘translators’ in the semantic integration of biomedical information). Foremost is the issue of heterogeneity with respect to the source data schemata and database management system products hosting the database. The Health-e-Child project will have several medical institutions contributing diverse biomedical data for the different vertical levels. It is likely that data sources for each level will have different schemata, using different software packages and varying types of access controls. A necessary first step towards bringing these disparate sources together is to identify the core entities for each level, and an intermediary data model per level will be proposed to capture the entities' structures. This effort will rely on reviewing and adopting existing ontologies of the subdomains and will benefit from work in other current biomedical projects such as MYGRID [25] and in initiatives such as WSMF [26]. The next step is to identify the relevant structures to unify these six levels (i.e. genetic, cell, tissue, organ, individual, and population) into a single data model whose semantics is captured by the integrated ontology. The integrated data model captures the structural representation of the level entities and their relationships, and provides a coherent view of the integrated domain. In addition to heterogeneity, other issues related to integrating distributed heterogeneous data sources will be addressed. Some of these issues are data-related (e.g. distribution, acquisition, normalisation, aggregation, curation), access-related (e.g.
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
267
transaction management, query processing), network-related (e.g. location, fragmentation), and privacy and security. Existing integration solutions will be consulted including data mediators, layered and meta-modelling approaches (including the CRISTAL [27] approach used in the MammoGrid project). Medical standards (e.g. HL7, DICOM, H12) and informatics standards (e.g. UML, XML, SQL, ODMG) will be followed as closely as possible, where applicable and feasible. To ensure the semantic coherence of the integrated data model, the formal concepts will be documented as ontological commitments. The ontological foundation of the analysis model will ensure the consistency of the design model used for the data integration and will form the basis for the medical knowledge management system. The formal conceptualisation of semantics will rely on existing medical ontologies and the relationship to those will be documented. The emerging framework of semantic models will be the foundation for the knowledge model behind the knowledge sharing and mining infrastructure of the Health-e-Child system. The Grid Gateway is responsible for data management and distribution; it provides access to data mining, knowledge discovery and optimisation algorithms; it incorporates and integrates raw medical data from various data sources. For client user interfaces and decision support components, the API and services of the Health-e-Child Grid Gateway provide an abstraction of the underlying Grid middleware, data resources and data management mechanisms. The data management layer will comprise a set of services for exploiting the information supplied by different medical information repositories. These services offer the following functionality: x Information and knowledge extraction. The information includes amongst others: metadata that can constitute part of the external schema, temporal attributes used for content tracking, as well as keywords to be used for indexing and content querying; x Repository content tracking. This component registers both the changes discovered in the stored data items, and the modifications of the schemata of the data sources. The tracking mechanism is used to maintain the integrity of the information and the external global schema; x A security level processor. This is in charge of ensuring user privileges coherence and provides mechanisms to define user profiles over the global schema. The restrictions imposed on the global schema are translated later into local data source restrictions and vice versa. The security mechanism will extend the one provided by the underlying Grid middleware to fit the particularly strict requirements of the medical domain; x Integration of applications and different databases like Oracle, DB2, LDAP directories, etc.; x Global database management: integrity functions, corruption detection, index rebuilding, etc.; x A data view which is fully conformant to the model incorporating vertical medical integration developed by WP6.
6
Related Projects
Initiatives from which Health-e-Child is expected to benefit include the BIRN [28] project in the US, which is enabling large-scale collaborations in biomedical science by
268
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
utilizing the capabilities of emerging Grid technologies. BIRN provides federated medical data, which enables a software ‘fabric’ for seamless and secure federation of data across the network and facilitates the collaborative use of domain tools and flexible processing/analysis frameworks for the study of Alzheimer’s disease. The INFOGENMED initiative [24] has given the lead to projects in moving from genomic information to individualized healthcare and Health-e-Child will build on its findings in vertical data modelling. The IHBIS project [2] has proposed a broker for the integration of heterogeneous information sources in order to collect protect and assemble information from electronic records held across distributed healthcare agencies. This philosophy is one that will also be investigated in Health-e-Child. In addition, the fame-permis project [29] is currently in the process of developing a flexible, authentication and authorization framework to cope with security issues for a healthcare environment; aspects that are important in the delivery of the Health-e-Child prototypes. Finally the CDSS [1] project being a system that uses knowledge extracted from clinical practice to provide a classification of patients’ illnesses, implemented on a Grid clearly impacts the decision support elements of the Health-e-Child project. Furthermore, the MYGRID project [25] is one which indicates the benefits of an ontological approach to federated data access on the Grid in the bioscience domain. MYGRID uses a web services approach as its underlying distributed systems infrastructure with an intention to migrate to Globus/OGSA based solutions at a later date. They use OWL [30] for the ontology language using description logic as opposed to the emerging WSMO (Web Service Modelling Ontology), based on first order logic, proposed recently [31]. WSMO is based on the Web Service Modelling Framework [32] and will enable the realization of true semantic web services, the next step in allowing Grid-based ontology mediation. Such developments are the first step in the provision of autonomous Semantic Grid systems. By adopting an ontology-based solution to unifying genetic/genomic data to patient/clinical data, it is expected that the Health-e-Child project will take an active role in influencing the future of biomedical ontology-based Grid solutions.
7
Conclusions – The Way Ahead
To reach its goals, Health-e-Child must innovate in diverse scientific areas. In particular, the scientific and technologic objectives of Health-e-Child are to advance the state-of-the-art in the following areas: x Translation, mapping, and matching of biomedical metadata (ontologies and other semantic metadata forms as well as syntactic structures) for vertical data integration; x Vertically integrated information modelling and knowledge representation; x Personalised models integrating all information sources about a patient’s health and disease biases; x Multi-objective optimization of biomedical information search; x Efficient similarity indexing and information retrieval from vertically integrated biomedical databases; x Autonomous, consistent, un-biased, and validated biomedical data analysis algorithms;
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
269
x
Cluster analysis, feature sensitivity analysis, data mining and association, and knowledge discovery from vertically integrated biomedical data; x Robust information fusion algorithms leading to new decision methods from integrated biomedical data; x Personalised methods for risk assessment, diagnosis, prevention and therapy; robust statistics and multiple hypothesis prediction for assessment of therapeutic response. The novel techniques resulting from the research in the areas above will be implemented and incorporated into the overall Health-e-Child system. The latter will be a distributed data and computation management system based on the Grid architecture. It will be built on top of the middleware developed as part of the EGEE project and, for security reasons, it will operate in an infrastructure that will be private to the project. Primary information will be collected at the sites of the three Hospital partners and, after appropriate anonymization and other necessary forms of curation, will be available for manipulation and processing by the rich stack of Health-e-Child modules. At the other end, clinicians will drive the system, using enabling tools to obtain second opinions on particular clinical cases or to identify interesting patterns in the data available while studying particular phenomena as part of biomedical research. In both scenarios, Health-e-Child will be a powerful tool in the hands of the clinician, bridging the gap between the latter’s conception of the biomedical problem at hand and the information available supporting the various alternatives for its solution. This paper has outlined the challenges facing the Health-e-Child project and has identified the project aims and objectives and highlighted its design strategy. In addition it has indicated the first steps being taken in the project to deliver an integrated platform for paediatrics that will become the foundation for future Gridbased biomedical solutions.
Acknowledgements The authors thank the European Commission and their institutes for support and particularly acknowledge the contribution to this paper of the following Health-e-Child consortium members: Alok Gupta (Siemens AG), Alberto Martini and Paolo Toma (IRCCS Giannina Gaslini, Genoa), Younes Boudjemline (Assistance Publique Hopitaux de Paris), Catherine Owens (Great Ormond St Hospital, London), Florida Estrella (CERN), David Manset and Alfonso Rios (Maat GKnowledge), Tamas Hauer and Dmitry Rogulin (UWE), Alessandro Verri (DISI) and Alessandro Sattanino (Lynkeus).
References [1]
I. Blanquer et al., “Clinical Decision Support Systems (CDSS) in GRID Environments” Accepted for publication at the 3rd International HealthGrid Conference, Oxford April 2005. Eds T Solomonides & R McClatchey, IOS Press Studies in Health Technology and Informatics.
[2]
D. Budgen et al., “Managing Healthcare Information : the Role of the Broker” Accepted for publication at the 3rd International HealthGrid Conference, Oxford April 2005. Eds T Solomonides & R McClatchey, IOS Press Studies in Health Technology and Informatics.
270
J. Freund et al. / Health-e-Child: An Integrated Biomedical Platform
[3]
The Information Societies Technology Project: MammoGrid – A European Federated Mammogram Database Implemented on a Grid Insfrastructure. EU Contract IST 2001-37614. http://mammogrid.vitamib.com.
[4]
GEMSS: Grid-Enabled Medical Simulation Services. http://www.gemss.de
[5]
The Information Societies Technology Project: Health-e-Child EU Contract IST-2004-027749
[6]
The HealthGrid White Paper, available from: http://www.healthGrid.org/download.php
[7]
S. Nørager & Y. Paindaveine, "The HealthGrid Terms of Reference", EU Report Version 1.0. 2002.
[8]
Sowa J. Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley. 1984.
[9]
D. Fensel., " Ontologies: A Silver Bullet for Knowledge Management & Electronic Commerce”. Springer-Verlag publishers 2000.
[10] Niles, I and Pease, A: Towards a Standard Upper Ontology. Proceedings of the Second International Conference on Formal Ontology in the information Systems”, Ogunquit, Maine, 2001 [11] Gangemi, A and Guarino, N. and Masolo, C., and Oltramari, A: Sweetening Wordnet with DOLCE. AI Magazine, 24(3):13-24, 2003 [12] Hovy E.H. Combining and Standardizing Large-Scale, Practical Ontologies for Machine Translation and Other Uses. In Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC-'98). Granada, Spain, 1998. [13] Noy, N. F. “Semantic Integration: A Survey of Ontology-Based Approaches” ACM SIGMOD Record, Vol. 33 No.4, December 2004, Special Issue on Semantic Integration. [14] G. Stumme and A. Madche. FCA-Merge: Bottom-up Merging of Ontologies. In 7th Intl. Conf. on Artificial Intelligence (IJCAI ’01), pages 225–230, Seattle, WA, 2001 [15] M. Crubezy and M. A. Musen. Ontologies in support of problem solving. In S. Staab and R. Studer, editors, Handbook on Ontologies, pages 321–342. Springer, 2003 [16] J. Barwise and J. Seligman. Information Flow: The Logic of Distributed Systems. Cambridge University Press, 1997. [17] UMLS http://www.nlm.nih.gov/research/umls/ [18] GALEN http://www.galen.org/ [19] GO http://www.geneontology.org/ [20] A. Maedche etal.., “MAFRA - a mapping framework for distributed Ontologies”. In 13th European Conf. on Knowledge Engineering and Knowledge Management EKAW, Madrid, Spain, 2002. [21] D. Dou, D. McDermott, and P. Qi. Ontology translation on the semantic web. In International Conference on Ontologies, Databases and Applications of Semantics, 2003. [22] D. Calvanese, G. Giacomo, and M. Lenzerini. Ontology of integration and integration of ontologies. In Description Logic Workshop (DL 2001), pages 10–19, 2001. [23] EGEE: Enabling Grids for E-Science in Europe. http://public.eu-egee.org [24] J.L Oiveira et al., “DiseaseCard: A Web-Based Tool for the Collaborative Integration of Genetic and Medical Information”. Biological and Medical Data Analysis, 5th International Symposium, ISBMDA 2004, Barcelona, Spain, November 18-19, 2004 [25] The UK E-science project: “MYGRID – Directly Supporting the E-Scientist” http://mygrid.man.ac.uk/ [26] Web Services Resource Framework, WSRF http://www.globus.org/wsrf/ [27] F. Estrella, Z. Kovacs, J-M LeGoff & R McClatchey., “MetaData Objects as the Basis for System Evolution”. LNCS Vol 2118, pp 390-399. Springer-Verlad publishers, 2001. [28] M Ellisman et al., “Biomedical Informatics Research Network” Accepted for publication at the 3rd International HealthGrid Conference, Oxford April 2005. Eds T Solomonides & R McClatchey, IOS Press Studies in Health Technology and Informatics. [29] N.Chang, J. Chin, A. Rector, C. Goble & Y. Li., “Towards an Authentication Middleware to Support Ubiquitous Web Access”. Proc of the 28th Annual Int. Computer Software and Application Conference (COMPSAC 2004) Vol 2, IEEE Computer Society. Hong Kong September 2004. [30] OWL http://www.w3.org/TR/owl-features/ and http://www.daml.org/services/owl-s/1.0/ [31] Kashif Iqbal (Digital Enterprise Research Institute, Ireland) presentation at the Semantic Grid Research Group, Global Grid Forum 13, Korea March 2005. [32] D. Fensel & C. Bussler., “WSMF: Web Services Modelling Framework”.
Part V Medical Assessment and HealthGrid Applications
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
273
Grid Empowered Sharing of Medical Expertise ˇ Krajícek ˇ a , Petr Lesný b , Jan Vejvalka b and Tomáš Holeˇcek c Martin Kuba a, Ondrej a Institute of Computer Science, Masaryk University, Brno, Czech Republic b ENT Department of Faculty Hospital Motol, Prague, Czech Republic c Faculty of Humanities, Charles University, Prague, Czech Republic Abstract. Today, applications for Grids emerge in various scientific fields, each with specific requirements. We present concept and architecture which enables biomedical experts to collaborate and share resources by encapsulating their knowledge and expertise as grid services, with (semi-)formally described semantics. Grid Services allow machine processing of the encapsulated knowledge, while their semantic description provides means for their automated discovery and interaction. This brings new possibilities of building biomedical systems offering machinedriven assistance to the biomedical experts. Keywords. semantic grid, knowledge sharing, expert modules, workflow
1. Introduction The vision of Grid was originally motivated by High Performance Computing community needs to pool resources in order to have more computing power and to be able to solve computationally intensive problems in less time. However, over time, more diverse communities become interested in Grid not because they needed more computational power, but because they needed the sharing of resources across organizational boundaries, which Grid enables, and these resources were not only processors and disk storage, but also remotely controlled instruments, information and knowledge. That’s why grids are sometimes classified into categories of computational grids (aggregating computational power), data/information/knowledge grids (sharing information) and collaborative grids (creating virtual environments for cooperation among geographically dispersed humans) [1]. The various communities interested in Grid have different and often contradictory demands. For example, computational physicists need lots of computational power, but they are not very interested in security of their data nor they require any short term time constraints on the results delivery. On the other hand, for using grids in health care, computational power is not that important, but responsiveness of resources and — most importantly — security of data is essential. In this paper we present an architecture of a grid, which falls somewhere between knowledge grids and collaborative grids, because it enables biomedical experts to use other expert’s knowledge encapsulated as grid services (the knowledge part), while communicating with yet another experts in real time (the collaborative part).
274
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
Our project, called MediGrid, brings together experts in biomedicine and experts in grid technology to work on a prototype biomedical grid environment, intended for integrating knowledge and supporting cooperation among experts from various areas of medicine.
2. Sharing biomedical expertise using a grid environment Our work is motivated by the need of medical experts to effectively share the type of their knowledge, which can be expressed as algorithmizable assessing of available data and thus provided as data-processing services to others. Collaboration of medical experts can be performed in two possible ways, single-user or multi-user. The first option is that a single medical expert uses the provided dataprocessing services without contacting other experts, thus collaboration is in provision of the automated services. The second option is that several medical experts can use videoconferencing tools and a shared desktop integrated with the grid environment for cooperation while selecting appropriate services and processing data. For example, consider the following scenario. One medical expert may develop a formula which captures some relation of a person’s characteristics (like weight and height to the skin surface area), a second expert may develop a formula which captures the relation of drug dosages to skin surface area, and a third expert has a patient’s weight and height and needs to compute a drug dosage. In the conventional case, the knowledge captured in such formulas must be transfered to the third expert, which must then manually calculate or write a program to calculate the results according to the formulas. In a slightly advanced scenario, the two experts may provide the formulas as independent calculators on web pages, as Excel spreadsheets or Windows programs, but the third expert still has to manually enter the values, copy outputs of the first formula as inputs to the second formula and record the final results. And in every case, the third expert must find that those formulas already exist. In our grid, the knowledge captured in such formulas may be expressed as grid services, and the discovery of such services, feeding the data to the services, interconnecting the services and obtaining the results can be automated. So, in the grid scenario, the third medical expert uses machine assistance to find and interconnect the services, perhaps consulting the selection or obtained results with other experts using integrated videoconferencing tools. Data processing is done automatically. 2.1. Semantic grid and workflows The technologies, developed so far for the other types of Grid, provide the basic level of guarantees needed for crossing organizational boundaries – web-services based implementation of grid services provides universal compatibility across all platforms, thus providing independence on any particular vendor, hardware, operating system or programming language. Grid security measures – data encryption, digital signatures, public key infrastructures for authentication, personal credential delegation – help in building a secure distributed system. However the discovery and interconnection of services encapsulating expertise needs advanced grid techniques. One of them is machine processing of meaning (i.e. seman-
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
275
tics) of data. This is where techniques developed for Semantic Grid and Semantic Web Services [3] can be utilized. Inputs and outputs of a service can be annotated as having some particular meaning by using references to entities in formally written ontologies (for example using URIs - Uniform Resource Identifiers - referring to ontologies written in the standardized OWL language [4]). Then tools can be built which allow to compare whether some input and output have the same meaning, or one is generalization of the other, or they are incompatible. When such relationships can be decided, outputs of some services can be used as inputs of other services, composing more complex workflows. Continuing in the previous example, the two services can be composed into a workflow which computes drug dosage from weight and height. Thus the utility of the combined system is significantly greater than that of the sum of its parts, which is a defining mark of a grid [2]. 2.2. Trust management In a grid, all communicating parties must be authenticated, i.e. to provide their identity. However just knowing identity of expertise providers may not be enough for selecting trustworthy services. The scale of the Grid and possibility of ad hoc cooperation require additional trust management framework, which helps users to decide how credible are the other parties. For example, authentication establishes that some service is provided by John Doe, employee of Biocyte Pharmaceuticals, but that does not provide enough information about how credible his services may be. That information must be asserted by other parties, like other users or certification committees, whose credibility must be in the final instance judged by the user. A trust management framework is thus essential for scalability beyond small groups which cooperate regularly.
3. Adapting biomedical information for the grid 3.1. Biomedical ontology In order to implement the algorithmic medicine (e.g. the aforementioned ’formulas’) in the grid environment, we need solid biomedical ontological foundation. Ontology as a branch of philosophy is the science of what is, of the meanings and structures of objects, properties, events, processes and relations in every area of life. The term biomedical ontology refers to the attempts of structuring the biomedical domain knowledge, e.g. disease taxonomies, medical procedures, anatomical terms, in a wide variety of medical terminologies, thesauri and classification systems. The most known of these systems are ICD [10], SNOMED-CT [11], GO [12], MeSH [14] and UMLS [13]. These systems usually consist of a thesaurus of biomedical terms or concepts on one side and a set of relationships between them on the other side. The basic hierarchical link between the thesaurus elements in most of these systems is the is_a relationship. If (one element is_a another element), then the first element is more specific in meaning than the second element. Other widely defined relationships [umls] are part_of, result_of, consist_of or associated_with. These relationships are perfectly suitable for describing semantic, functional or topological structures. However, in order to employ the grid in the biomedicine, we need ontological description of the biomedical work-
276
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
flow. Although the mentioned systems can be (in a very limited way) utilized in order to classify the elements of the data records, their limited set of relationships operators is insufficient in the description of the formulas or algorithms utilized in the medicine. 3.2. Indicators The key concept of processing of biomedical data workflow and describing the algorithms or formulas is the perception of data as indicators. Indicators are records, which were created by someone (including records created with the help of machines) in order to be read and used by someone. The idea of indication is taken from Edmund Husserl [15]: A thing is ... properly an indication if and where it in fact serves to indicate something to some thinking being ... a common circumstance [is] the fact, that certain objects or states of affairs of whose reality someone has actual knowledge indicate to him the reality of certain other objects or states of affairs, is the sense that his belief in the reality of the one is experienced (...) as motivating a belief or surmise in the reality of the other. [Italics by E. H.]
E.g. if the medical specialist reads the indicator “allergy to penicillin”, it motivates him or her to be careful about antibiotic prescription. All data recorded or exchanged in biomedicine are indicators and the biomedical algorithms can be viewed as transformations of indicators. Relations between indicators are the potentialities of their transformation. If we can transform the indicators “body height is 170 cm” and “body weight is 72 kg” into the indicator “Body Mass Index is 24.9 kg/m2 ”, we can express it as a relation between the given three indicators in given context. 3.3. Indicator ontology Indicators can be categorized into indicator classes, where an indicator class is a set of all possible indicators, determined by a specific concept. Subsequently we can view biomedical algorithms as relationships between indicator classes (in the above example, indicator classes of height measurement, weight measurement and BMI determine the relation “calculation of the BMI from body height and weight”). Ontology of indicators (Indicator Ontology, IO) is therefore parallel to traditional systems: where traditional systems manipulate with entities, IO uses the indicator classes; relationships between entities in traditional systems are parallel to relations between indicator classes in IO. Although both the traditional systems and the IO use concepts, their meaning is significantly different: whereas traditional systems usually implement relations between concepts, IO uses concepts only to determine indicator classes; relations in IO are rather defined between indicator classes. 3.4. Expert Modules A key building block for our expertise-sharing grid is an expert module. Expert module is an entity which encapsulates a self-contained knowledge element, such as an algorithm, or in the ontological context, a relationship among indicator classes. In functional sense, a module is a process with input and output, translating its input data to the output. When we use the module to perform some task, the module is said to be invoked.
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
277
For example, an expert module may encapsulate the knowledge how to transform a tuple of indicators, where one is in the indicator class height measurement and the other in the class weight measurement, to an indicator in class Body Mass Index. Or to use the example from section 2, two modules can be created, one transforming tuples of indicators in classes weight measurement and height measurement to indicators in class skin surface area, and second transforming tuples of indicators in classes skin surface area and prescribed drug to indicators in class drug dosage. An expert module thus represents a piece of biomedical expertise, and can be implemented as a grid service to become shared.
4. Architectural aspects of expert modules 4.1. Module specification The module is required to provide a specification (i.e. description) of itself. The specification consists mainly of these key components: • • • •
interface syntax (the definition of the structure of input and output data), semantic description of input/output data operational semantics (explicit specification of the implemented functionality) executional semantics: pre- and postconditions
This specification, however it may seem complex, has a simple motivation. Each module must clearly define structure of input and output data to enable users to properly invoke it. The semantic description of input and output data enhances their defined structure with information about its meaning, by mapping ontologies to individual data items (i.e. this number is a body temperature in degrees of Celsius). With this information, the automatic data conversions may be implemented automatically (such as converting between different measurement units). The specification of operational and executional semantics means, that each module has well defined (described) functionality and explicitly states, what conditions must be met prior to and immediately after its invocation. This conditions may for instance place some restrictions on the input data (such as: the input body temperature must be a positive floating point number). All these descriptions play important role in the module discovery and selection. The expert is not forced to choose the module he needs based on its name or description, but this process is aided by the computer. The computer is able to preselect modules based on queries on these characteristics (such as: select modules which deal with body temperature and accept the temperature in degrees of Celsius). 4.2. Module state Depending on its behavior, the module can be further characterized. This behavioral characterization is important for implementing interaction and composition (described in the next section). The fundamental idea is the notion of module internal state, we distinguish stateless and stateful modules. stateless modules are modules whose output data (the result of the module invocation) depend solely on the input data and produce no side effects with regard to the
278
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
computation1. If the modules use any external data, it must be contained in the module specification. stateful modules are modules, which may have external dependencies (such as relational databases, data files). Thus, it is possible to obtain different output data for the same input data. Stateless modules usually implement simple functions, which perform computations on module input data (such as computing BMI). Stateful modules may use accumulated records or statistical data during processing (such as computing difference from average BMI in population, when the population data is stored in a database). The statefulness of a module influences one important property, referred to as reproducibility of computation, which is one of the key requirements of our architecture. The meaning of reproducibility is best demonstrated on an example: assume we have an expert module with specification and we invoke the module at a specific time, obtaining some results. Later, we decide that we want to verify the results of the invocation and invoke the module again, at a later time, with the same input data. Now, possibility exists, that we obtain different output data. For stateful modules, this means that state of the module may independently change between invocations without explicit notification and the previous state may be lost. This is not a problem with stateless modules, however the definition of the module may change between invocation, so different outputs may be obtained from the same input even for the same stateless module. Reproducibility for the stateless modules thus must be sustained with proper identification of module. We must be able to identify, if the module definition has changed and retrieve the exactly same module we used for the first computation. 4.3. Module identification Basically, the module is identified by a unique name. The module name is used to refer to module (e.g. for the purpose of invocation). This mechanism is extended by versioning. Versioning is used to track changes in module implementation and specification over time. This allows for distinguishing between two variants of the same module and provides remedy for the reproducibility issue described above. Versioning of modules extends the name of an expert module with identification of its version (such as version number)2, making it integral part of the module name. New version of a module is introduced in two cases: • the specification of the module is changed (extended). The change may or may not break compatibility with the old version; • the implementation of the module is changed, however the specification is retained. With versioning in effect, we only can refer to the module if we know the exact version we want to refer to. For user convenience, version aliases may be implemented, referring for instance to the most recent version of a module or to the most recent stable version (if we consider experimental versions as well). 1 Administrative operations, such as storing information about module invocation for troubleshooting is not considered to be a side effect. 2 The version numbers are usually increasing positive integers.
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
279
With versioning, every invocation of a stateless module may be reproduced with exactly the same version used in the original invocation. With stateful module, we are able to reproduce the invocation, if we can restore the module to a state in which it was at the time of the original invocation. We may also introduce a compatibility scheme, which explicitly states which versions are compatible (substitute each other) and which are not.
5. Deploying expert modules on the Grid The notion of expert modules is important for communication with biomedical experts, for designing the biomedical system and for specifying contracts for implementation of such modules. However, the notion of module does not inherently integrate into existing Grid technologies (e.g. the GT4 based Grid). Therefore, a simple mapping between the knowledge oriented expert modules and service oriented Grids must be defined. It is important to emphasize, that this mapping is not naturally a wrapping mechanism. Instead, it is a specification which proposes design guidelines for implementing expert modules as software entities. Today, a service oriented architecture seems to be the prevalent and preferred approach to designing distributed systems, mainly thanks to loose coupling between specification and implementation of functionality, encapsulating it as a service. We present a methodology for implementing expert modules as services, integrating them seamlessly into a service oriented architecture. Our proposed architecture (SEAGRIN, see [6]) is based on Web Services and WSRF[9], but WSRF compliance is not required nor relevant to expert modules. For production-class implementation of the architecture, compliance with some interoperability standard (such as WS-I[7], or WS-I+[8]) would suffice. 5.1. Interactions in biomedical grid One of the key questions related to expert modules was if and how to implement more complex structures with expert modules. For instance, if it is possible to compose an expert module from smaller expert modules. For clarity and simplicity of the design, we have decided that module composition and interaction will not be taken into account on the level of expert modules and rather it will be implemented on the service level. By abstracting from these issues, the concept of an expert module is kept simple and easy to adopt. However, once the modules are deployed to the biomedical grid as services, they may interact with other modules and be composed into more complex structures, taking their specifications into account. Moreover, the deployed modules may be shared by all users of the biomedical grid (still, usual security considerations apply). For use case see Fig. 1. The deployed modules may be composed into complex workflows (refer to section 2.1). These workflows are exposed as grid services and are transparent to their potential users. For technical description of workflows, refer to [6].
280
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
Figure 1. Interaction in Biomedical Grid (Use Case)
6. Practical experience Currently, we are experimenting with building workflows of grid services deployed in Apache Axis. So far, we have met three significant problems, which we are currently trying to address: • Security • Module/service composition • Incompletness of biomedical ontologies We are designing our architecture to be secure by default. From the technological point of view, the major issue is the lack of production-class security standards for communication with web services. Experiments brought two distinct approaches: transport layer security and message level security. While message level security allows implementation of more sophisticated features (such as messages routing), systems implementing it have poor performance profile, since the overhead of encrypting and/or digitally signing individual messages is significant. For more information refer to [16]. We have decided to adopt the second approach, but to further elimitate the drawback of XML digital signatures, we are developping an S/MIME based solution. Another issue, the trust management in biomedical grid is described in section 2.2. Service composition, aggregation and workflow creation also presents a complex tasks. Current approaches to workflows usually present a tool which is oriented to singleuser and uses a notion of a job (job oriented workflows), for more information see [17] and [18]. Since we are encapsulating modules as services and encourage collaboration of experts, service oriented collaborative workflows are a natural requirement. For information on the infrastructure design which supports such workflows refer to [6] and [5]. For semantic annotations of data, we planed to use CUIs (Concept Unique Identifiers) from the established biomedical ontological system UMLS. However we did not
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
281
find CUIs for the concepts we actualy needed, like "the amount of fluids received in a dose of milk per kilogram of the recipient", so we are now investigating other solutions.
7. Conclusion In this paper, we have presented a grid oriented project which builds on the collaboration of experts from several diverse scientifical fields: biomedicine, grid computing and philosophy. Having such a broad expertise presents issues beyond technical challenges. To address them, the communication, mutual understanding, wide agreement and careful analysis must be emphasised. The MediGrid project is in its second year of existence and meets the presented challenges successfully so far.
8. Acknowledgments This research is supported by a research intent “Optical Network of National Research and Its New Applications” (MSM6383917201) and research project “MediGrid – methods and tools for GRID application in biomedicine” (Czech Academy of Sciences, grant T202090537)
References [1] I. Blanquer et al.: HealthGrid whitepaper, chapter 1. http://whitepaper. healthgrid.org/ [2] Ian Foster: What is the Grid? A Three Point Checklist. GRIDToday, July 20, 2002 [3] Jorge Cardoso, Amit Sheth: “Introduction to Semantic Web Services and Web Process Composition”, Lecture Notes in Computer Science, Springer-Verlag, Volume 3387 / 2005, ISBN: 3-540-24328-3, January 2005 [4] M. Horridge: Protege OWL Tutorial http://www.co-ode.org/resources/ tutorials/ProtegeOWLTutorial.pdf [5] Bocchi L., Krajíˇcek O., Kuba M.: Infrastructure for Adaptive Workflows in Semantic Grids. Proceedings of the first CoreGRID Integration Workshop. Pisa : University of Pisa, 2005. p. 327-336. 2005, Pisa, Italy [6] Kuba M., Krajíˇcek O., Lesný P., Holeˇcek T.: Semantic Grid Infrastructure for Applications in Biomedicine. DATAKON 2005 - Proceedings of the Annual Database Conference: 2005, p. 335-344. 2005, Brno, Czech Republic, ISBN 80-210-3813-6 [7] Ballinger K. et al.: WS-I Basic Profile Version 1.1 (Final Material). http://www.ws-i. org/Profiles/BasicProfile-1.1.html [8] Atkinson M. et al.: Web Service Grids: An Evolutionary Approach. http://www.nesc. ac.uk/technical_papers/UKeS-2004-05.pdf [9] Banks T.: Web Services Resource Framework (WSRF) – Primer. http://docs. oasis-open.org/wsrf/wsrf-primer-1.2-primer-cd-01.pdf [10] http://www.who.int/classifications/icd/ [11] http://www.snomed.org [12] http://www.geneontology.org [13] http://www.nlm.nih.gov/research/umls/ [14] http://www.nlm.nih.gov/mesh/
282
M. Kuba et al. / Grid Empowered Sharing of Medical Expertise
[15] Husserl E. Logical Investigations, Investigation I. (translated by J. N. Finday). Routledge, London, 2001, § 2., 184. Orig. Logische Untersuchungen, M. Niemeyer, 1900/1901. [16] Shirasuna S. et al., Performance Comparison of Security Mechanisms for Grid Services. Fifth IEEE/ACM International Workhop on Grid Computing (associated with Supercomputing 2004). Pittsburgh, PA, 2004. ISBN: 0-7695-2256-4. ISSN: 1550-5510. [17] http://www.trianacode.org [18] http://www.lpds.sztaki.hu/pgrade/
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
283
Mobile Peer-To-Grid Architecture for Paramedical Emergency Operations Andrew Harrison a Ian Kelley a,b Emil Mieilica c Adina Riposan c,d,1 and Ian Taylor a,b a School of Computer Science, Cardiff, UK b Center for Computation and Technology, LSU, USA c Contact Net Ltd, Bucharest, Romania d Department of Applied Informatics, Military Technical Academy Abstract. In this paper we describe a distributed architecture that could be used to link emergency medical centres, hospitals, telephone operators, and ambulances into a hybrid Peer-to-Peer (P2P) and Grid system for the sharing of information and transport of data. Distributed computing techniques can be used to connect static and mobile systems, bringing the different tools, expertise and databases together to aggregate patient data “on-the-fly” and then integrate it into a situation and contextspecific patient-centred virtual environment. The scenario presented in this paper encapsulates connecting mobile tools and medical devices from ambulances, enabling data transfer to medical centres, and aggregating patient data from numerous sources. The proposed P2G (Peer-to-Grid) framework consolidates Peer-to-Peer and Grid computing research by addressing the mobility of transiently connected devices while supporting interactive configurability of components for dynamic data-driven distributed paramedical scenarios. Keywords. Grid P2P WSRF Mobile WSPeer Medical Paramedical Emergency
1. Introduction Recent advances in Grid [3][14] and Peer-to-Peer (P2P) technologies [10][11] now enable the research community to begin exploring how these systems could be adapted towards improving the current operating procedures of common yet difficult problems that persist in our day-to-day lives. The health care field is one such area, which due to its distributed nature of hospitals, emergency centres, and ambulances, combined with its high security requirements regarding sensitive patient data, matches well to what could be accomplished using a hybrid Grid/P2P infrastructure. By combining Grid and P2P systems [5], an intelligent and context-aware distributed architecture can be developed that would dramatically increase the amount of information available to paramedical teams as they respond to emergencies, as well as improve information exchange with hospitals, emergency dispatch centres, and doctors. This information would range from patient data such as allergies and current medical condi1 Email:
adina.riposan@contactnet.ro, Web: www.contact-net.ro
284
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
tions to infrastructure related decisions such as traffic advisories and the relative proximity of hospitals with adequate medical facilities. By forming an effective and welladministered ambulance and fleet-management system that interconnects with medical centres, an emergency institution’s capacity to respond to crisis in a timely manner is greatly enhanced. In addition, once such a system is in place, it facilitates interoperation and communication with other government entities such as the fire brigade, police force, and public defense institutions. As Grid computing has evolved, the field has moved towards service-based architectures, similar to web services [15], in the form of specifications such as the Open Grid Service Architecture (OGSA) [4]. With the move towards standards that promote service interoperability like WS-RF [3][9], it is now becoming possible to apply common techniques to different application and technology domains to create hybrid systems. In this paper, we present how such a hybrid framework of Grid and P2P, which we call P2G (Peer-to-Grid), could be used to address the issues of a real paramedical emergency scenario. The move towards P2G represents a shift in paradigm from two perspectives: it addresses the seamless integration of mobile and static distributed resources, incorporating solutions to enable true mobility and transient connectability of such devices; and it supports the run-time integration of components and data through service orientation and dynamic discovery. This paper is organized as follows: Section 2 introduces the medical support system; Section 3 describes ambulance fleet-management; Section 4 outlines the overall system architecture; Section 5 defines the security implications. Section 6 concludes the paper.
2. Paramedical Emergency Support In this section we introduce the medical infrastructure involved in P2G emergency operations. To show how these infrastructure components could work together, a scenario is presented which details the step-by-step process that would take place from the moment an emergency call is placed until a patient arrives at an emergency centre. 2.1. The Medical Infrastructure: Here, we describe the different types of participants and core facilities which are available within the scope of our scenario. These include static entities, such as the emergency control centre and the medical units, as well as mobile entities, such as the ambulances and portable medical devices, which need to communicate patient information on-the-fly when necessary. The static entities we have identified are listed in the following: • The Medical Emergency Control Centre is a paramedical control centre that acts as the management and monitoring authority for emergency operations. • The Medical Data Resource Centres are the medical units or hospitals that record patient data (keeping health records databases). These centres provide patient data to the infrastructure during an emergency if health records or other patient-data are necessary and available for discovery. • The Medical Emergency Units are the medical units or hospitals that are accepted and acknowledged as Emergency Medical institutions where emergency patients can be transported. The selection of a Medical Emergency Unit for a
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
285
particular situation is based upon factors such as proximity to accident, available space, medical expertise and the specific health requirements of the patient. The Medical Data Resource Centres and the Medical Emergency Units may or may not be the same during a particular paramedical emergency operation. Medical units are assigned initial roles in the infrastructure, but it is expected that many will be able to function as both patient data resource providers and emergency units, making their role in a particular situation case-specific. The Emergency Control Centre itself may play any or both of these roles. Beyond providing basic access to medical data and patient health records, virtually all the medical units can play the roles of knowledge providers or resources, giving access to expert opinions and linking to more advanced situation analysis tools. Emergency operations in these scenarios are highly dependent on ensuring that the required information is delivered in real-time and supported by expert assessments with the assistance of teams from the Medical Emergency Control Centre as well as other participating medical units. Therefore, static entities in the emergency infrastructure ensure adequate management and monitoring functions, as well as providing a Data and Knowledge Access point to mobile entities. The Mobile Entities within this scenario include: • The Fleet of Ambulances are managed and monitored by the Emergency Control Centre to maintain both mobility and automatic navigation requirements in emergency operations, as well as communication and data transfer among different participants within the infrastructure and data transmission between the paramedical mobile devices and the virtual emergency environment. • Medical Devices which consist of the various types of mobile devices used in ambulances that can potentially record real-time data from patients. Such medical tools and equipment can be used for ad hoc medical data acquisition, initial patient diagnosis and monitoring of the patient’s current medical condition during transit to a hospital facility. Specialised software and tools are used for the acquisition and quantification of the patient’s vital medical data, using monitoring sensors for the monitoring of the body functions and parameters, which is subsequently transmitted to other entities involved in the emergency operation. • Personnel Devices are mobile devices which are used to communicate with participating individuals such as paramedics and ambulance drivers. 2.2. A Paramedical Scenario: In this section we present a scenario of how the interactions between the Medical Infrastructure entities are managed within real-world scenarios that integrate the static and mobile entities to potentially automate the medical planning process within an emergency situation. 2.2.1. Initiation and Monitoring of the Emergency Operation When an emergency call is received by an Emergency Control Centre operator, he/she performs a primarily analysis and makes the decision to initiate an emergency operation. A Virtual Emergency Environment (VEE) [1] is created to dynamically integrate the appropriate patient-related data from both the medical units that are currently treating and monitoring the patient (static data resources) and the medical diagnosis tools in the am-
286
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
bulance during the patient’s initial examination and transportation to the hospital (mobile data resources). Ambulances are selected based upon proximity to the emergency and are dispatched to the emergency site, with the mobile-grid fleet management system ensuring proper directions and automating the navigation-specific functions that are necessary, thereby providing more reliable response service over large geographic areas and high-risk locations (see Section 3). In more complex rescue operations, this system would be able to interface with other intervention units such as the fire brigade or police force. During the initial patient analysis, raw data is acquired from the scene by the paramedical team and transmitted to the Emergency Control Centre, which monitors operations and maintains communication with the ambulance’s paramedical team. In some cases, the emergency operation is supervised and assisted by experts during the initial examination, diagnosis, and transportation to the medical emergency unit. This could potentially involve specialists from several hospitals, requiring the interactive communication and data transmission among all the actors and knowledge sources to be integrated within a Virtual Emergency Environment (VEE). By calculating factors such as the initial diagnosis, the distance to travel, and available bed space, an appropriate hospital is chosen by conducting a search of available resources. The casualty unit in the selected hospital is notified and is provided with access to the patient’s medical records. 2.2.2. On-site Emergency Management We envisage an emergency scenario for complex and high-risk situations that requires the presence of several ambulances to conduct the rescue operation, potentially in large areas and high-risk environments. For such cases, one must take into account not only the permanent communication channels with the Emergency Control and Data Centres, but also local communications between on-site entities involved in the operation, such as the fire brigade and police force. In the ambulance environment there are a set of mobile devices (medical devices and personnel devices) that communicate with each other and exchange data locally during the rescue operation. For example, the personnel mobile devices receive notices and automatic messages from the medical tools and sensor devices that permanently monitor the patient state and perform ad-hoc vital data acquisition from the patient. A controller node is situated in the ambulance and acts as a gateway to the ambulance environment, handling data transfer to other ambulances participating in the rescue operation, and to the Emergency Control Center. The controller node could either be integrated into the ambulance itself or added to the vehicle as a separate device (see Section 3). Mobile device communication via a controller node (the ambulance) is unnecessary for the internal communication within the ambulance environment, however, we consider it necessary for the communication with the other ambulance environments and the Emergency Control Centre. One important reason for such necessity is the security and data privacy requirement regarding patient data that is published into and received from the Virtual Emergency Environment. Although trust relationships may exist between the mobile devices participating in the local ambulance environment, for outside communication and data exchange with the VEE there are additional authentication and authorization requirements. This additional security layer would be handled by the controller node, using the local authentication and remote authorization system of the ambulance
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
287
(see Section 5), in connection with the communication interface of the vehicle (see Section 3). The ambulance itself plays an important role in maintaining mobility, coordination, and automatic navigation requirements in emergency operations. The fleet management system automatically ensures direct communication between the ambulance’s controller node and the Emergency Control Center, as well as coordinating on-site communication with the other ambulances and rescue participants (fire brigade and police force) (see Section 3). In order to achieve fault tolerance in the case where a controller node fails or becomes unavailable, we envisage the capability to automatically transfer the coordination and management functions to another mobile entity. In this event, the responsibility for publishing data into, and receiving support from , the VEE would be transfered to another mobile device(s) in the ambulance environment, such as personnel devices that have the required technical capabilities. 2.2.3. Data and Knowledge Access Within the paramedical emergency infrastructure, data and knowledge resources can be highly distributed. These resources need to be combined into an Emergency-centric Gridbased virtual environment in order to ensure patient-centred rescue operations and medical care. Patient medical data resides in health records repositories hosted at different hospitals and medical units where patients are currently being treated and monitored. Mobile units such as ambulances need to discover the Medical Data Resource Centres and locate the available health records by using available information about the patient in order to integrate the required data and create the patient-centred VEE. The data and knowledge resource discovery can be achieved by performing a decentralised search across the P2G network super-peers or hubs, thus maintaining both the dynamism and robustness of the network. The type of information available can range from simple identification, e.g., a driving license, to the use of a patients’ Electronic Health Smartcards / Health Insurance Smartcards if available. Such cards can provide information about the medical units keeping health records connected to the patient under consideration, thereby helping in the distributed search. They also provide immediate access to vital data and the medical history of the patient, which can be easily integrated in the virtual emergency space of the patient and support the decision making process. Patient identification through the use of electronic health smartcards or health insurance smartcards can be achieved using local authentication to the ambulance facilitated by the communication interface in the vehicle and remote authorisation using GSI (see Section 5). 2.2.4. Creating the Virtual Emergency Environment At the Emergency Control Centre, a VEE can be automatically constructed. This is where all information needed for the optimum care of the patient is made available (as described in [1]). In our case, such information can be accessed through standard searching mechanisms offered within the P2G architecture. A first responder (e.g., an ambulance driver) can be granted access rights for the VEE and can communicate with other participants using standard voice over IP (VOIP) or even video technology. Specialised software for the acquisition and quantification of a patient’s vital medical data and monitoring sensors for the monitoring of the body functions and parameters
288
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
can be used as initial ambulance diagnosis tools. Enhancing the use of specialised ambulant radiology tools to perform radiology examinations by using mobile MRI or CT could increase even more the usefulness of this scenario. The current patient’s condition during travelling to the hospital can be communicated to the VEE. The sensors and the software operators for sensor data acquisition are gathering the patient’s medical and physiological parameters, biometric and environmental data. This setup will allow for unobtrusive monitoring of electrocardiogram (ECG), heart and respiratory rates, blood pressure, blood glucose level, oxygen saturation, skin temperature, etc. Sensors can be integrated via base stations and pre-processed locally. Filtered signals and data are instantly transmitted to the Emergency Control Centre and integrated within the VEE. All medical and physiological measurements become thus available, via the gateway. Further, as described in [1] a Virtual Emergency Health Record (VEHR) could be created. 3. Ambulances Fleet Management: The fleet management application for the effective automatic administration of the fleet of ambulances can be interfaced for end users by a specific fleet management portlet integrated into the Portal. A Web-based portal can be customised by any organization hosting the Medical Emergency Control centre for administering the services provided by the proposed infrastructure. The portal facilitates the sharing, distribution, adaptation and evolution beyond the state-of-the-art in fleet-management. The system allows the vehicular mobile Grid nodes to monitor their environment, record changes and react intelligently in a dynamically changing environment. Currently there are a number of proprietary systems that have attempted to tackle the issue of fleet management, but largely have ignored the important areas of: • Integration with onboard equipment and in-vehicle environment: there is a pool of information available about the internal state of the vehicles that can be used for better decision making both for the driver and for the management or administrator of the fleet of Ambulances. Mobile agent can constantly monitor and record this data and take decisions or generate reports based on these recordings. Further integration with onboard systems, such as the sound system, will provide the basis of a better, more secure way to interact with the driver. • Interoperability with other systems: core new services and capabilities can take full advantage of the benefits the Grid has to offer: interoperability, dynamic use and discovery of resources, ubiquity, composability of services, security, notification, collaborative working, data handling, and remote visualisation. • Availability on a wider geographical area: intelligent mobile agents can adapt to the changing environment including variations on the availability and Quality of Service (QoS) of accessible mobile data carriers. Such agents can choose the appropriate means of communication by analysing available bandwidth, predicted latency, cost, etc. 4. System Architecture The scenario described in Section 2 poses specific challenges for a system architecture. In particular the integration of static, computationally powerful infrastructures with
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
Mobile Grid
289
Multiple Mobile devices
Bridge Layer
Virtual Emergency Environment
Static Grid
Hospitals and patient record databases
Figure 1. The Mobile-To-Grid Architecture.
lightweight, mobile and potentially transient devices must be addressed. A core requirement of the system is that it handles the limitations and heterogeneity of the mobile network and negotiates between this network and the static Grid. Negotiation responsibilities should include translating between the static Grid and the mobile network, for example security contexts, and maintaining connectivity and location transparency of mobile nodes. Furthermore mobile devices require a network infrastructure that allows them to discover, and relay data and messages to other nodes. This infrastructure must support node unreliability, that is, handle changes in transport and dynamic network path configuration. To enable the integration between static and mobile Grids we believe there are two important concepts that should be adhered to. Firstly that the system should be loosely coupled and secondly that this loose coupling should be supported by a common messaging framework. By designing the system around the Open Grid Services Architecture (OGSA) and it’s current incarnation - The Web Services Resource Framework (WSRF) [3] - we believe this can be achieved. WS-RF allows systems to combine service orientation with standard SOAP message exchange patterns. In particular WS-RF allows the de-coupling of resources exposed via services (called WS-Resources in WS-RF parlance) from the service that exposes them. In a resource intensive environment as is described, this is essential to maintain a loosely coupled system. 4.1. Mobile P2P Grids The requirement of negotiating between networks can be addressed through a three layered architecture which connects the static Grid with the mobile network via a bridging layer. This bridging layer needs to able to face two ways, that is, it must be capable of handling both the security, service and transport infrastructure of the static Grid and of the mobile network. Figure 1 depicts this three-layer design. In this architecture mobile nodes communicate directly with the bridge layer which in turn communicates with static Grid entities such as hospitals and patient record databases. Here the bridge layer acts
290
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
Mobile Grid
Multiple Ambulance Environments
Bridge Layer
Virtual Emergency Environment
Static Grid
Hospitals and patient record databases
Figure 2. The Peer-To-Grid Architecture.
as an aggregation service for the various devices in the mobile Grid. This architecture does not addresses certain requirements of the scenario however. In particular it does not allow local discovery and decision making. For example, if patient data is required on site, this needs to be retrieved by each node that requires it from the bridge layer. This in turn hinders scalability because all communication must go via the bridge layer and hence it is in danger of becoming a bottleneck to the system. The requirement for mobile nodes to communicate directly with one another, as well as the ability for the on-site communication model to scale to any number of ambulances and other mobile emergency teams such as the fire brigade requires an approach that allows nodes to act as both service providers and consumers that can discover one another in a decentralized manner - that is a Peer-To-Peer (P2P) system. Furthermore the structure of ambulance environments and the security requirement for controlling access to the bridge layer from within these ambulance environments suggests a centralized/decentralized system such as a super peer architecture. Other projects that address Grid and mobility do not directly consider the topology issue. The Mobile OGSI.NET [16] system does not address discovery or topology at all, concentrating solely on exposing mobile devices in an OGSA compliant manner. Similarly the Akogrimo [1] project delegates discovery and topology to a variety of protocols. We consider the introduction of a super peer architecture to handle controlled yet scalable interactions between diverse nodes a significant development in mobile-to-static Grid infrastructures. Figure 2 depicts the design we call Peer-To-Grid (P2G) which extends the architecture depicted in Figure 1 by enabling groups of peers to interact directly with one another within their group (the ambulance environment in our case) and for these groups to interact with each other and the bridge layer via a super peer (the ambulance controller node in our case). The implementation design of P2G is based on the Web services framework called WSPeer [8]. WSPeer is an API focused on the cross-fertilisation of P2P and Web/Grid services. As well as supporting simple Web service interactions it is also capable of interacting with Grid middleware using WS-RF and related specifications. It allows easy inte-
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
291
gration of different protocols underneath the Web services technology stack and currently supports among others, a lightweight, domain independent framework capable of advertisement, discovery and communication within ad-hoc P2P networks called P2PS [11]. Like JXTA, P2PS uses the ‘pipe’ abstraction for defining communication channels. Pipes can traverse multiple transports and contain intermediary nodes. P2PS also uses logical addressing and endpoint resolver services for translating logical addresses to network specific addresses. This is particularly useful in handling host mobility and migration as it allows the logical address to remain consistent while the underlying location or transport changes. A subset of WSPeer functionality is currently being developed which allows mobile devices to communicate with each other using Web service and WS-RF messages over HTTP and P2PS. Using WSPeer allows us to implement the core functionality of system nodes as described in the following sections, in particular the ability to bridge different protocols and topologies and function in resource constrained devices while maintaining a standards compliant service interface. 4.2. Bridge Node architecture Structurally we expect the bridge layer to be static and stable, much as the Grid layer is. The nodes that inhabit the bridge layer create a view of the mobile network that the static Grid can understand and a view of the static Grid that the P2P network can understand. Internally the bridge nodes must map between between these views. We consider a proxy design pattern the most appropriate. Mobile nodes and groups are presented to the Grid via this proxy allowing the Grid to conceive of them as usual Grid entities. This architecture requires certain properties in the proxy which enable bridging between networks. Firstly, the proxy should possess agent-like autonomy, as it may have to make decisions or send messages on behalf of a mobile node if the mobile node is unavailable and immediate action is required. This may require the caching of status information on the node or group in question. Secondly, to facilitate connectivity transparency, the proxy must be able to store messages received from the static Grid and pass them onto mobile nodes when they are available again. This capability also allows optimization of message transfer, enabling the messages, for example notifications, to be aggregated into a single message. Finally, the proxy should perform translation between contexts, for example translating from a group-based security infrastructure in the P2P network to a user-based infrastructure in the static Grid. 4.3. Mobile Node architecture Based on the requirements of the scenario and the capabilities of WSPeer, we have designed a mobile node architecture depicted in Figure 3. A mobile node is made up of a number of modules that communicate with one another to facilitate the functionality required by the scenario. 1. The Service Layer defines the interface to the mobile node, that is the protocols and messages understood by the node. This is implemented as an OGSA (currently WS-RF) service interface. The services exposed by a node consist of three types. Two types are shared by all nodes; these are services which handle the migration of the service to another node and the establishment of data channels. The
292
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
OGSA Service Layer
SOAP module
Data channel module
Migration module
Device module
P2P module
Protocol module HTTP, SIP etc
Transport module TPC/IP, UDP, Bluetooth etc
Networks
Figure 3. An overview a mobile node.
2. 3.
4. 5.
6.
7.
8.
third type are node specific services that depend on the nature and function of the node in the network. The SOAP module parses and generates SOAP messages. The Data Channel module accepts and generates usually binary data flows such as images of audio/visual streams. The details of how this module is connected to is defined at the OGSA service level. A typical data channel might be described as accepting (or expecting) a certain MIME type, whether the data is to be streamed and where, if anywhere, the data should be passed on to. The ability to set the input and output sources for the channel will enable the construction of workflows in which data is streamed directly from node to node without passing through a controller node. The migration module handles the migration of the service layer and any state or data (WS-Resources) hosted by the node. The P2P module provides information to the service modules about the network neighborhood and offers capabilities for discovering other mobile nodes and resolving logical addresses. These capabilities are exposed by the service layer as network compatible services. The device module is capable of introspecting the device and returning relevant information required by the other modules. For example the location of the device may need to be exposed as an OGSA service, but may also be required by the P2P module to decide on the best transport to use for connecting to another node. The device module is also capable of describing the device as a WS-Resource. This resource is exposed via the OGSA service interface. The protocol module handles different transfer protocols such as HTTP or Session Initiation Protocol (SIP). Protocols may be chosen directly by the P2P layer or passed to it from the service layer or data channel module. The transport module handles the details of the network transport chosen by the P2P layer in consultation with the device module.
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
293
5. Security Infrastructure The distributed system described here can be divided into two fundamental parts: a static set of Grid resources and multiple dynamic resource sets that will be created in an ad hoc method, as needed, using secure P2P protocols [2] and services. This meshing of Grid and P2P systems requires that traditional security infrastructures be enhanced to support both environments in a seamless manner. This could be done either by extending current security mechanisms to support both systems or by using a "bridge" to link the networks together without modifying their native security infrastructures [12][7]. Regardless of the final implementation, the central concerns of authorization and data integrity will be the same – how to ensure correct access rights and integrity of data. It is therefore fundamental that when such systems are designed, and ultimately deployed, that security policies are established that can verify the identity of users and services, protect communications, and make intelligent decisions about what access rights a particular entity is allowed within the network. The dynamic nature of fleet operations requires that users are able to create services “on the fly” without administrator intervention. These services must be coordinated and able to interact securely with other partner services. Therefore, the security infrastructure involved in these operations must be able to adapt to new services and users as well as be configurable to allow for multiple VOs and provide rich group membership information. In addition, the nature of the mobile fleet requires a security infrastructure that can be used in the context of a system that requires disconnected operation. Local authentication functionality and remote authorization must be decoupled, allowing for nodes to appear and disappear off the network, yet still maintain a session between connections. This way, a mobile node can operate offline, storing secured tokens that are then reused for updates and synchronization operations when the mobile node is again online, without the need for classical authentication to the server. Due to the mobility, time constraints, and stress factors involved in emergency operations, alternative end-user verification methods are required that go beyond traditional input mechanisms such as keyboards, which are prone to error and require a high level of user interaction. To address these issues, more portable and user-friendly technologies such as smart cards or biometrics are needed. In this scenario, certificate files provided by a smartcard device would be used to authorize local operations and, in connection with the communication interface of the vehicle, could be used to authorize the user to remote Grid infrastructure and services, allowing them to publish or receive vital medical information.
6. Conclusion In this paper, we presented a combined Grid and Peer-to-Peer architecture for application to dynamic paramedical scenarios that involve on-the-fly connecting and communicating across a collection of distributed resources. The resources we described in this scenario can be static, such as the emergency control centre and the medical units through to the mobile entities, such as the ambulances, which need to communicate on-the-fly patient information when necessary. We described the real-world interactions between these entities and then described an infrastructure, called P2G, which addresses the various problems that such a scenario exhibits.
294
A. Harrison et al. / Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations
References [1] Akogrimo. Project website. Jan 25, 2006. see http://www.mobilegrids.org/ [2] Miguel Castro, Peter Druschel, Ayalvadi Ganesh, Antony Rowstron and Dan S. Wallach. “Secure routing for structured peer-to-peer overlay networks.” Fifth Symposium on Operating Systems Design and Implementation. 2002. [3] K. Czajkowski, D. F. Ferguson, I. Foster, J. Frey, S. Graham, I. Sedukhin, D. Snelling, S. Tuecke, W. Vambenepe. “The WS-Resource Framework.” March 5, 2004. [4] I. Foster, C. Kesselman, J. Nick, S. Tuecke. “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration.” Open Grid Service Infrastructure WG, Global Grid Forum. June 22, 2002. [5] I. Foster and A. Iamnitchi. “On death, taxes, and the convergence of peer-to-peer and grid computing.” In 2nd International Workshop on Peer-to-Peer Systems (IPTPS ’03). 2003. [6] G. Fox, D. Gannon, and M. Thomas. “A Summary of Grid Computing Environments’.’ Concurrency and Computation: Practice and Experience (Special Issue). 2003. [7] R. Ghanea-Hercock. “Authentication with P2P Agents.” BT Technology Journal, Volume 21, Issue 4, Pages 146 - 152. Dec 2003. [8] A. Harrison and I. Taylor. “WSPeer - An Interface to Web Service Hosting and Invocation.” In HIPS Joint Workshop on High-Performance Grid Computing and High-Level Paral lel Programming Models. 2005. [9] M. Humphrey et al. “State and Events for Web Services: A Comparison of Five WS-Resource Framework and WS-Notification Implementations.” 4th IEEE International Symposium on High Performance Distributed Computing (HPDC-14). 2005. [10] JXTA: 2005, “Project JXTA”. see http://www.jxta.org [11] I. Wang. “P2PS (Peer-to-Peer Simplified)”. In Proceedings of 13th Annual Mardi Gras Conference - Frontiers of Grid Applications and Technologies, pages 54–59. 2005. [12] J. Novotny, S. Tuecke, V. Welch. “An Online Credential Repository for the Grid: MyProxy.” 10th IEEE International Symposium on High Performance Distributed Computing. 2001. [13] D. de Roure, M. A. Baker, N. R. Jennings and N. R. Shadbolt. “The Evolution of the Grid. ” In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality, pages 65–100. Wiley & Sons. 2003. [14] SSDL — The SOAP service definition language. Project website. see http://www.ssdl.org/ [15] W3C Working Group Note. Web Services Architecture. 11 February 2004. see http://www.w3.org/TR/2004/NOTE-ws-arch-20040211/ [16] D. Chu and M. Humphrey, “Mobile OGSI.NET: Grid Computing on Mobile Devices”, In 5th IEEE/ACM International Workshop on Grid Computing - GRID2004 (associate with Supercomputing 2004). Nov 8 2004, Pittsburgh, PA.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
295
Virtual Hospital and Digital Medicine Why is the GRID needed? Georgi GRASCHEWa, Theo A. ROELOFSa, Stefan RAKOWSKYa and Peter M. SCHLAGa a Surgical Research Unit OP 2000, Robert-Roessle-Klinik and Max-DelbrueckCentrum, Charité – University Medicine Berlin, Lindenberger Weg 80, D-13125 Berlin, Germany
and b
Paul HEINZLREITERb, Dieter KRANZLMÜLLERb, and Jens VOLKERTb GUP - Institute of Graphics and Parallel Processing, Johannes Kepler University Linz, Altenbergerstrasse 69, A-4040 Linz, Austria Abstract. The promise of telemedicine to enable equal access to high-level medical care can only be realised by the development of virtual hospitals, digital medicine and a bridging of the digital divide between the various regions of the world. For this, the concept of the Grid should be integrated with other communication networks and platforms. A promising approach is the implementation of service-oriented architectures for an invisible grid, hiding complexity for both application developers and end-users. Therefore integrated Grid Tools Suites for application programmers must be developed. Thus enhancing development and deployment of Grid applications can bring us closer to the simplicity of use that is desired for grid infrastructures not only within the medico-clinical field. Keywords. Health Grid Applications, Grid Middleware, Grid Tools Suite, Telemedicine, Virtual Hospital, Digital Medicine
The EMISPHER Network for Telemedical Applications The EMISPHER1 project (Euro-Mediterranean Internet-Satellite Platform for Health, medical Education and Research, EUMEDIS pilot project 110, co-funded by the EC under the EUMEDIS2 programme, strand 2, sector 1) is dedicated to telemedicine, EHealth and medical E-Learning in the Euro-Mediterranean area. Telemedicine aims at equal access to medical expertise irrespective of the geographical location of the person in need. New developments in Information and Communication Technologies (ICT) have enabled the transmission of medical images in sufficiently high quality that allows for a reliable diagnosis to be determined by the expert at the receiving site [28-31]. At the same time, these innovative developments in ICT over the last decade bear the risk of creating and amplifying a digital divide in the world, creating a disparity 1 2
http://www.emispher.org http://www.eumedis.net
296
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
between the northern and the southern Euro-Mediterranean area. The digital divide in the field of health care has a direct impact on the quality of the daily life of the citizens. In recent years, different institutions have launched several Euro-Mediterranean telemedicine projects. All of them aimed to encourage the cooperation between the European member states and the Mediterranean Countries. During its implementation over the last two years, EMISPHER has deployed and put in operation a dedicated internet-satellite platform consisting of currently 10 sites in 5 MEDA countries (Casablanca, Algiers, Tunis, Cairo and Istanbul) and 5 EU countries (Palermo, Athens, Nicosia, Clermont-Ferrand and Berlin). The EMISPHER network hosts three key applications: x Medical eLearning: EMISPHER Virtual Medical University with courses for under-graduates, graduates, young medical professionals, etc., in real-time and asynchronous modes x Real-time telemedicine: Second opinion, demonstration and spread of new techniques, telementoring, etc. x eHealth: Medical assistance for tourists and expatriates [32]
Figure 1:EMISPHER Network over satellite (phase 1)
From EMISPHER towards the Deployment of a Virtual Euro-Mediterranean Hospital The EMISPHER network serves as a basis for the development and deployment of a Virtual Hospital for the Euro-Mediterranean region. The Virtual Euro-Mediterranean Hospital aims to facilitate and accelerate the interconnection and interoperability of the various medical applications, being developed by different organisation at different sites, by integrating them into a consistent set of services. Activities will include
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
297
various real-time telemedicine services to support implementation of evidence-based medicine and area-wide coverage including all Mediterranean countries.
Communication Infrastructures for a Virtual Hospital The communication infrastructure of such a Virtual Euro-Mediterranean Hospital should integrate satellite-based networks like EMISPHER with suitable terrestrial channels such as GÉANT2 and EUMEDCONNECT, as well as wireless channels and capabilities for ad-hoc networks. Due to the character of the Virtual Euro-Mediterranean Hospital, data and computing resources are distributed over many sites. Therefore Grid infrastructures [9] become a useful tool for the successful deployment of medical applications and therefore provide medical personnel with the required information, computation, and communication services [33].
Applicability of Grid Systems within the Medico-Clinical Domain Services like acquisition and processing of medical images, data storage, archiving and retrieval, as well as data mining being applied especially for evidence-based medicine are common requirements within the medical application domain. In addition, simulations and modelling for therapy planning and computer-assisted interventions, and large multi-center epidemiological studies are typical clinical services that will profit strongly from the development and implementation of suitable Health Grid environments. According to [10] typical applications of grid technology include x Distributed supercomputing x High throughput computing x On demand computing x Data intensive applications x Collaborative applications Medico-clinical applications can consequently benefit to a large extent from being executed on the Grid since many use cases can be assigned to the categories above. Giving access to distributed services in a wide-area network of connected institutions a Grid-based system can integrate domain knowledge, powerful computing resources for analytical tasks and means of communication with partners and consultants in a trusted and secure system, tailored according to the users requirements. An additional advantage of Grid middleware within a medical context is the ability of sharing expensive clinical and scientific instruments by enabling secure remote access. Example usage scenarios are presented in [8] and [26]. Another typical application area of Grid computing within the medico-clinical domain is given by simulating the effects of a surgical intervention. Examples for such Grid-based medical simulations, which are developed within the scope of the Austrian Grid3 project, are the SEE-GRID [4] system, which is used for interactive planning of eye surgeries using Grid-based simulation and the distributed simulation of blood flow in the coronary arteries [21]. A strong point for integrating eHealth services with Grid infrastructures is given 3
http://www.austriangrid.at
298
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
by the Grid expertise within the European research community. Currently, the largest and most important concerted effort in the European Grid field is the EGEE-Grid4: In phase 1, the EU funded EGEE project consists of 71 partners, which as of today provide a 24/7 production grid with more than 180 sites being connected world-wide, offering more than 16000 CPUs to the EGEE grid community: The EC funding share of EGEE from 2004-2006 is about 32M€. For the second phase, which is currently under negotiations, EGEE is planned to increase to 92 partners with a EC funding of 36M€ from 2006-2008.
Standardization and the Web Services Resource Framework However, the current state of the art of Grids is still far from the fully functional framework envisioned by Foster and Kesselman [9] since commercial and day-by-day Grid-aware applications are still not widely available. This is mainly due to the lack of proper tools for supporting programmers in the design, development and implementation of Grid-based applications, as well as a lack of a complete set of standards and of specific programming suites and APIs. Standards are being developed and much work is consolidated but fully interoperable and secure procedures are still missing. Standardization itself enables creating interoperable, portable, and reusable components and systems. One approach in this direction is the use of services instead of low level routines thus hiding the restrictions of the system from the grid application developer. The Web Services Resource Framework (WS-RF) [6] represents a Grid system architecture based on web services concepts and technologies. The current release of the Globus Toolkit [7] offers a collection of Grid services that follow architectural principles of WS-RF. The Globus middleware also enables the development of new Grid services which provide a higher level functionality by building them on top of simpler services.
Hiding the Complexity of Grid Middleware Toolkits Considering the domain of grid-based tools and high-level services there has been quite an amount of research results over the last years. The main target is given by hiding the complexity of the underlying Grid infrastructure from the application developer by integrating higher level tools and services for grid application development. Tools which are designed for developers of grid applications are for example GridBench [25] for benchmarking Grid applications, Santa-G [15] for network monitoring in grid environments, OCM-G [2] for online application monitoring, and MARMOT [16] for debugging grid-based MPI applications, which have all been developed within the EU Crossgrid5 project and used for the development and performance improvement of Crossgrid applications such as the Lattice-Boltzmann blood-flow simulator [1]. Another toolset for supporting application development for the Grid is GrADS [3], which aims at providing an application development environment, which insulates the programmer from the complexity of the underlying grid infrastructure. This is especially important since the grid infrastructure is not static, but changing over time. 4 5
http://www.eu-egee.org http://www.eu-crossgrid.org
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
299
A very important issue is enabling user-friendly grid access. The restrictions of the Grid infrastructure have to be hidden from the user enabling him to work with a specific grid-enabled application without necessarily realising that he is actually using the grid.
Figure 2: The “Invisible” Grid for Health Services: Hide the Grid complexity from the application-developers and end-users
An example for a user GUI which is used to access a Grid testbed is the Migrating Desktop (MD) [18] from the Crossgrid project, which offers a framework for the integration of Grid applications. The added value of such a tool is the “windows”-like look-and-feel of the graphical user interface, which allows inexperienced users to easily access Grid resources. The MD hides the complexity of the Grid and enables access to the resources using a user-friendly, interactive, and intuitive graphical environment. The MD is written in Java and therefore platform independent. The MD framework is extensible for new applications using a plugin mechanism. The full power of such an integrated Grid user environment was shown within the Crossgrid project where all applications and tools developed within the project were integrated with the MD acting as a common user interface. The MD supports mobile users by providing a personalized Roaming Access Service which is accessible using a web browser. The flooding application [14], which was developed within the Crossgrid project provides another good example for a user friendly grid application. Its main user interface besides the MD is provided by a web portal implemented in Java, which can be used by the crisis team to invoke flooding simulations which can predict the outcome of a current flooding crisis. The selection of Grid resources for this parameter study is done automatically by the Crossgrid broker [13].
Integrative Approach: Grid Tool Suite (GTS) The main reason for the lack of Grid-aware applications appears to be a gap between the Grid infrastructures and their developers / operators on the one side and the developers and end-users of Grid-based applications on the other side. To bridge this gap a user-driven approach needs to be implemented, which includes all stakeholders.
300
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
A Grid Tool Suite (GTS) needs to be developed that will facilitate and enhance the development of Grid-aware applications. The architecture of the GTS needs to be service-oriented and based on the needs of both application developers and end-users. Inclusion of already existing tools fulfilling the defined requirements, as well as the development of new tools and extensions of existing ones will guarantee building on previous achievements while not compromising the strict functionality and architecture requirements. The GTS shall be validated through its actual use for the development and implementation of a selected application with high demands, which are strictly enduser defined. GTS will be validated through well-defined iterative cyclic feedback between Grid developers realising the GTS, programmers developing the applications and the end-users of these applications. The validation process aims to develop a Gridbased system for use in clinical settings, named CARDIOTOOL. It will contain major components that leverage the possibilities provided by the underlying Grid infrastructure: x High-performance computing and powerful visualisation based on patientspecific simulation models that are derived from information contained in medical images x Ubiquitous access to high-performance and user-friendly applications (automated ECG evaluation) x Intelligent access to distributed data for e-learning and medical consultations Using the service-oriented approach, the GTS will support day-by-day application development in a generic way.
Grid Technology Application Examples One example of a promising application of Grid technology within the Virtual Hospital is given by the real-time 3D-visualization and manipulation for individualized treatment planning and training purposes. Experience of the last years shows that by an improved preoperative planning supported by three-dimensional stereoscopic visualisation and modelling better medical results can be achieved. The reason for this is that by planning the therapy can be individualized and the matching to the patient can be improved. This has been shown especially for the construction and adjustment of implants such as a hip joint, maxillofacial surgery, radiation planning, and neurosurgical interventions. In this regard it is not only necessary to supply a three-dimensional morphological patient model but also functional interrelations like tissue properties, function, and blood circulation. To enable fast 3D visualization and interactive inspection of CT, MRT and US patient data after semi-automatic segmentation and reconstruction Grid technology comes into consideration. Currently, using the existing View Sphere Rendering software about 5000 views on an imaginary sphere enclosing the data cube (CT, MRT) can be displayed as a result of off-line pre-calculation of the views [34]. As a result of these pre-calculation of the view sphere the user is able to rotate the data cube in realtime (50 frames per second on each channel) using different interaction devices such as joystick, mouse, keyboard, or voice control to inspect not only the original 2D slices but also recalculated slices for any new orientation of the data set, and to navigate through the data slices inside a region of interest. Using grid resources an on-line calculation of the view sphere would be possible
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
301
thus enhancing operation planning facilities considerably. Also Grid services like the Grid Visualization Kernel (GVK) [12] can be used for 3D visualisation. Another promising example of a Grid-based application is the MammoGrid project6, which developed a Europe-wide database of mammograms to facilitate a set of first priority healthcare applications. Key aspects here are standardisation of mammograms, design of an appropriate clinical workstation for the end-user, as well as the distribution of data, images and clinical queries across a Grid-based distributed database. Beyond these specific applications, a more generic goal was to explore the potential of Grid to support effective collaborative work (in particular for collaborative medical image analysis) between healthcare professionals located at geographically dispersed sites across Europe. In general it is expected that medical applications will profit most from the Grid when they involve large amounts of image data distributed across dispersed sites, which treatment and analysis is computing resource intensive and/or can be improved by computer-aided routines. Potential examples are in the field of Teleradiology (see above) or Telepathology (virtual microscopy).
Grid-based Medical Visualization Using GVK The Grid Visualization Kernel GVK has been successfully applied for medical visualization on the Grid as described in [24]. Within the biomedical application which was developed in the scope of the Crossgrid project GVK modules are used for generating a blood flow visualization. The visualization modules are running remotely on the Grid and can be steered interactively using the Desktop Virtual Radiology Explorer GUI. The interactivity as well as the data transport within this distributed application is achieved by using glogin [22] while the visualization itself is based on VTK7 functionality. The applicability of Grid-based visualization services for stereoscopic Virtual Reality devices such as the CAVE [5] has been discussed in [17]. Another important building block for grid-based visualization services is given by G-Vid [20] which offers a Grid-based video service enabling the optimized transmission of a remotely rendered visualization to the users desktop machine. This is achieved by transmitting a compressed video stream over the Grid. For the use case described above the rendering of multiple images using different viewpoints is required. Taking into account the typically large size of medical datasets the rendering techniques have to be highly parallelized to exploit the performance of grid resources to the full extent. If multiple distributed grid resources are used for rendering, the input data has to be distributed onto the resources beforehand. Rendering approaches which can be applied on medical volume datasets can be separated into direct volume rendering approaches such as raycasting [11] or splatting [27] and surface fitting algorithms like isosurface extraction [19]. Using direct volume rendering the images can be generated directly out of the volume data while isosurface extraction requires an additional rendering step for image generation. This rendering step can also be executed in parallel using an off-screen rendering library and depthbuffer merging for assembling the final images. One would typically execute the rendering step on the same resource as the isosurface extraction thus saving the 6 7
http://MammoGrid.vitamib.com http://www.vtk.org
302
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
transmission of the intermediate triangle mesh. All techniques are well parallelizeable and applicable within a Grid environment yielding significant performance benefits. In case of raycasting each of the parallel rendering processes or threads needs the whole input dataset for image generation while the other approaches only require one part of the volume data per rendering thread or process. After the parallel rendering has finished the final image has to be assembled and sent to the users desktop machine. Besides parallelizing the rendering of a single image the invocation of multiple instances of the rendering module applied will prove beneficial since multiple images can be rendered concurrently. Which rendering technique proves best for a given application such as online rendering of images to be mapped onto the view sphere heavily depends on the type of input data as well as the types of resources available. If a set of distributed resources is available such as a pool of workstations interconnected over a fast network, the data to be rendered can be distributed beforehand and each workstation can do the rendering for a different viewpoint. If there is a multiprocessor machine with shared memory available, the input data needs not be replicated and parallelized rendering techniques can be applied without inducing higher latency due to communication costs.
Conclusions & Perspectives For successful deployment of the various medical and clinical services in the Virtual Euro-Mediterranean Hospital, the development and implementation of Health Grid services appears crucial. The Virtual Hospital will foster cross-Mediterranean cooperation between the leading medical centres of the participating countries by establishing a permanent medical and scientific link. Through the deployment and operation of an integrated satellite and terrestrial interactive communication platform, it will provide for medical professionals in the whole Euro-Mediterranean area access to the required quality of medical service depending on the individual needs of each of the partner. The experience of many Grid projects showed that the development cycle of Grid applications is obviously different than the development of cluster and HPC applications, largely due to the dynamic nature of Grids. Developers of Grid applications are facing new problems. Currently the debugging and monitoring of Grid applications is a complex challenge that requires new distributed tools. Most of these tools are stand-alone applications, which address one specific problem on the Grid. A short time-to-market cycle requires an integrated development tool since it will significantly reduce the time and costs required for application development. For an increased success of Grid technology and further incubation of novel application areas, an integrated framework of cooperating software and middleware modules including GUI support for Grid application users, Grid application developers and for Grid operators and maintainers is essential. Subsequently, the implementation of this new approach might trigger a critical evaluation and adaptation / optimisation of the medical workflow and corresponding decision-making trees. References [1]
A.M. Artoli, A.G. Hoekstra, P.M.A. Sloot, “Simulation of a Systolic Circle in a Realistic Artery with the Latiice Boltzmann BGK Method”, International Journal of Modern Physics B, Vol. 17, No. 1/2, pp. 95-98, (January 2003).
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
[2] [3]
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
[15] [16] [17] [18] [19] [20] [21] [22] [23] [24]
303
B. Balis, M. Bubak, M. Radecki, T. Szepieniec, R. Wismüller, “Application Monitoring in Crossgrid and Other Grid Projects”, in: Proc. of the 2nd European AcrossGrids Conference (AxGrids 2004), Nicosia, Cyprus, LNCS 3165, Springer Verlag, (January 2004). F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C: Kesselman, J. Mellor-Crumme, D. Reed, L. Torczon, R. Wolski, “The GrADS Project: Software Support for High-Level Grid Application Development”, International Journal of High Performance Computing Applications, Vol. 15, No. 4, pp. 327-344, (2001). K. Bosa, W. Schreiner, M. Buchberger, T. Kaltofen, “The Initial Version of SEE-GRID”, Technical Report AG-DA1c-1-2005_v1, Research Institute for Symbolic Computation (RISC), Johannes Kepler University Linz, (March 2005). C. Cruz-Neira, D.J. Sandin, T.A. DeFanti, R.V. Kenyon, J.C. Hart, “The CAVE: Audio Visual Experience Automatic Virtual Environment”, Communications of the ACM, Vol. 35, No. 6, pp. 64-72, (June 1992). K. Czajkowski, D.F. Ferguson, I. Foster, J. Frey, S. Graham, I. Sedukhin, D. Snelling, S. Tuecke, W. Vambenepe, “The WS-Resource Framework. Version 1.0”, (March 2004) I. Foster, C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit“, International Journal of Supercomputing Applications, Vol. 11, No. 2, pp. 115-128, (1997). I. Foster, J. Insley, G. von Laszewski, C. Kesselman, M. Thiebaux, “Distance Visualization: Data Exploration on the Grid“, IEEE Computer Magazine, Vol. 32, No. 12, pp. 36-43, (December 1999) I. Foster, C. Kesselman (Eds.), “The Grid. Blueprint for a New Computing Infrastructure“, Morgan Kaufmann Publishers, (1999). I. Foster, C. Kesselman, “Computational Grids“, in [9], pp. 15-51, (1999). A.S. Glassner (Ed.), “An Introduction to Raytracing“, Academic Press, 1989 P. Heinzlreiter, D. Kranzlmüller, “Visualization Services on the Grid – The Grid Visualization Kernel“, Parallel Processing Letters (PPL), Vol. 13, No. 2, pp. 135-148, (June 2003). E. Heymann, A. Fernandez, M.A. Senar, J. Salt, “The EU-CrossGrid Approach for Grid Application Scheduling“, in: Proc. of the 1st European Across Grids Conference (AxGrids ‘03), Santiago de Compostela, Spain, pp. 17-24, (February 2003). L. Hluchy, O. Habala, V.D. Tran, B. Simo, J. Astalos, M. Dobrucky, “Infrastructure for Gridbased Virtual Organizations”, in: Proc. of the International Conference on Computational Science (ICCS ‘04), Cracow, Poland, Part III, LNCS 3038, Springer Verlag, pp. 124-131, (June 2004). S. Kenny, B.A. Coghlan, “Towards a Grid-Wide Intrusion Detection System”, in: Proc. of the European Grid Conference (EGC ‘05), Amsterdam, The Nederlands, LNCS 3470, Springer Verlag, (February 2005). B. Krammer, M.S. Müller, M.M. Resch, “MPI Application Development Using the Analysis Tool MARMOT“, in: Proc. of the International Conference on Computational Science (ICCS ‘04), Cracow, Poland, LNCS 3038, Springer Verlag, pp. 464-471, (June 2004). D. Kranzlmüller, H. Rosmanith, P. Heinzlreiter, M. Polak, “Interactive Virtual Reality on the Grid“, in: Proc. of the 8th IEEE International Symposium on Distributed Simulation and Realtime Applications (DS-RT ‘04), Budapest, Hungary, (October 2004). M. Kupczyk, R. Lichwala, N. Meyer, B. Palak, M. Plociennik, P. Wolniewicz, “Migrating Desktop Interface for Several Grid Infrastructures”, in: Proc. of Parallel and Distributed Computing and Networks (PDCN ‘04), Innsbruck, Austria, (February 2004). W.E. Lorensen, H.E. Cline, “Marching Cubes: A High Resolution 3D Surface Construction Algorithm”, in: Proc. of ACM SIGGRAPH ’87, Anaheim, CA, USA, pp. 163-169, (July 1987). M. Polak, D. Kranzlmüller, J. Volkert, “G-VID – A Dynamic Grid Videoservice for Advanced Visualization“, in: Proc. of the Cracow Grid Workshop (CGW ‘04), Cracow, Poland, pp. 471477, (December 2004). B. Quatember, F. Veit, “Simulation Model of the Coronary Artery Flow Dynamics and Its Applicability in the Area of Coronary Surgery”, in: Proc. of the 1995 EUROSIM Conference, Vienna, Austria, pp. 945-950, (September 1995). H. Rosmanith, D. Kranzlmüller, “glogin – A Multifunctional, Interactive Tunnel into the Grid”, in: Proc. of the 5th IEEE/ACM International Workshop on Grid Computing (GRID’04), pp. 266272, Pittsburgh, PA, USA, (November 2004) P.M.A. Sloot, G.D. van Albada, E.V. Zudilova, P. Heinzlreiter, D. Kranzlmüller, J. Volkert, “Grid-Based Interactive Visualization of Medical Images“, in: S. Norager, Proc. HealthGrid, 1st European HealthGrid Conference, Working Document, Lyon, France, pp. 57-66, (January 2003). A. Tirado-Ramos, H. Ragas, D. Shamonin, H. Rosmanith, D. Kranzlmüller, “Integration of Blood Flow Visualization on the Grid: the FlowFish/GVK Approach”, Revised Papers, 2nd European AcrossGrids Conference (AxGrids 2004), Nicosia, Cyprus, LNCS 3165, Springer Verlag, (January 2004).
304
G. Graschew et al. / Virtual Hospital and Digital Medicine – Why Is the GRID Needed?
[25] [26]
[27] [28] [29] [30]
[31] [32] [33]
[34]
G. Tsouloupas, M.D. Dikaiakos, “GridBench: A Tool for Benchmarking Grids”, in: Proc. of the 4th IEEE/ACM International Workshop on Grid Computing (GRID ‘03), Phoenix, AZ, USA, (November 2003). Y. Wang, F. De Carlo, D. Manchini, I. McNulty, B. Tieman, J. Bresnahan, I. Foster, J. Insley, P. Lane, G. von Laszewski, C. Kesselman, M.-H. Su, M. Thiebaux, “A High-Throughput X-Ray Microtomography System at the Advanced Photon Source”, Review of Scientific Instruments, Vol. 71, No. 4, pp. 2062-2068, (April 2001). L.A. Westover, “Splatting: A Parallel, Feed-Forward Volume Rendering Algorithm”, Doctoral Thesis, University of North Carolina at Chapel Hill, NC, USA, (January 1992). G. Graschew, S. Rakowsky, P. Balanou, P.M. Schlag, “Interactive telemedicine in the operating theatre of the future”, J. Telemed. Telecare, Vol. 6, suppl 2, pp. 20-24, (2000). R.U. Pande, Y. Patel, C.J. Powers, G. D’Ancona, H.L. Karamanoukian, “The telecommunication revolution in the medical field: present applications and future perspective”, Curr. Surg. Vol. 60, pp. 636-640, (2003). C. Dario, A. Dunbar, F. Feliciani, M. Garcia-Barbero, S. Giovannetti, G. Graschew, A. Güell, A. Horsch, M. Jenssen, L. Kleinebreil, R. Latifi, M. M. Lleo, P. Mancini, M. T. J. Mohr, P. Ortiz García, S. Pedersen, J. M. Pérez-Sastre, A. Rey, “Opportunities and Challenges of eHealth and Telemedicine via Satellite”, Eur J. Med. Res. Vol. 10, Suppl I, p. 1-52, (2005). G. Graschew, T.A. Roelofs, S. Rakowsky, P.M. Schlag, “Broadband Networks for Interactive Telemedical Applications”, APOC 2002, Applications of Broadband Optical and Wireless Networks, Shanghai 2002, Proc. of SPIE, Vol. 4912, pp. 1-6, (2002). G. Graschew, T.A. Roelofs, S. Rakowsky, P.M. Schlag, „Überbrückung der digitalen Teilung in der Euro-Mediterranen Gesundheitsversorgung – das EMISPHER-Projekt“, in: JÄCKEL A. (Ed.), Telemedizinführer Deutschland 2005, (Ober-Mörlen), pp. 231-236, (2005). G. Graschew, T.A. Roelofs, S. Rakowsky, P.M. Schlag, and S. Kaiser, S. Albayrak, “Telemedical applications and GRID technology”, in: P.M.A. Sloot et al. (Ed.), Advances in Grid Computing - EGC 2005, European GRID Conference, Amsterdam, The Netherlands, 14.16.2.2005, p. 1-5, (2005). G. Bellaire, G. Graschew, F. Engel-Murke, M. Krauss, P. Neumann, P. M. Schlag, “Interactive telemedicine in surgery: Fast 3-D visualization of medical volume data”, Min. Inv. Med. Vol. 8, pp. 22-26, (1997).
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
305
Final Results and Exploitation Plans for MammoGrid Chiara Del Fratec, Jose Galveza, Tamas Hauerb, David Mansetd, Richard McClatcheyb, Mohammed Odehb, Dmitry Rogulinb, Tony Solomonidesb, Ruth Warrene a
b
CERN, 1211 Geneva 23, Switzerland CCCS Research Centre, Univ. of West of England, Frenchay, Bristol, BS16 1QY, UK c Istituto di Radiologia, Università degli Studi di Udine, Italy d Maat GKnowledge, Toledo, Spain e Breast Care Unit, Addenbrookes Hospital, Cambridge, UK
Abstract. The MammoGrid project has delivered the first deployed instance of a healthgrid for clinical mammography that spans national boundaries. During the last year, the final MammoGrid prototype has undergone a series of rigorous tests undertaken by radiologists in the UK and Italy and this paper draws conclusions from those tests for the benefit of the Healthgrid community. In addition, lessons learned during the lifetime of the project are detailed and recommendations drawn for future health applications using grids. Following the completion of the project, plans have been put in place for the commercialisation of the MammoGrid system and this is also reported in this article. Particular emphasis is placed on the issues surrounding the transition from collaborative research project to a marketable product. This paper concludes by highlighting some of the potential areas of future development and research. Keywords. Medical imaging, grid application, deployment, exploitation and commercialisation
1
Introduction
The EU-funded project MammoGrid set out to explore the following conjecture: grid technology and standards have evolved to the point where a prototype federated database of mammograms might be constructed, based on centres in three European countries (UK, Italy and Switzerland). The project was conducted between September 2002 and August 2005 and comprised partners from the universities of Oxford, UWE-Bristol and Cambridge in the UK, CERN in Switzerland and institutes in Pisa, Sassari and Udine in Italy. The project has developed a set of prototypes with a number of regular papers published by MammoGrid partners and the reader is referred to these for the technical aspects of the project in [1], [2], and [3]. In the last year of the project, a final prototype has been constructed and has been deployed to the University hospitals in Cambridge and Udine for proof-of-concept evaluation that would demonstrate the use of a grid-based medical platform in clinical tests. More specifically, MammoGrid set out to explore the following clinical issues through the delivery of the first grid-based mammography platform:
306
x
x
x
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
Image standardisation: The appearance of a mammogram is greatly affected by differences in image acquisition processes (machine type, filter, exposure time etc.). Such differences can impact significantly radiologist judgements (presence of microcalcifications, estimation of proportion of dense breast tissue). Ideally, the given image would be “standardised” by removing such anatomically irrelevant variations prior to adding it to the database. MammoGrid explored the possibility of standardising images using the Standard Mammogram FormTM (SMF) representation developed by Highnam and Brady [4]. Breast density as a risk factor: It has been suggested, primarily by Boyd [5] and others, that the amount/percentage of dense tissue in the breast is a major (perhaps, the major) risk factor for breast cancer (after taking due account of lifetime experiences). The SMF representation provides a number of measures of the amount/proportion of dense tissue, so the MammoGrid clinical partners at Cambridge and Udine hospitals sought to compare measurements of breast density as provided by SMF with the standard methods of visual assessment [6] and the automated 2D interactive computer programme available from Jaffe and Boyd [7]. Computer-aided detection of microcalcifications & masses: Prior to MammoGrid, the project partners from Sassari and Pisa Universities had developed a system named CALMA (Computer Aided Library for Mammography) for the detection of lesions and microcalcifications with reportedly good sensitivity and specificity [8]. The aim was to use the database of mammograms generated during the MammoGrid project at Cambridge and Udine to: (a) re-assess the sensitivity and specificity of CALMA, and (b) to examine whether its performance would be improved on the standardised images generated by SMF.
MammoGrid is one of a number of European Healthgrid projects e.g. [9], [10], and [11]. We note also that the UK “e-Science” programme has funded the e-Diamond project [12], which also aims at developing a federated database of mammograms, but whereas MammoGrid is based on open-source software, eDiamond is based on (IBM) proprietary technology and concentrates on two complementary applications, namely teaching and FindOneLikeIt. Also, in the United States, the National Digital Mammography Archive (NDMA) [13] has adopted a radically different approach, using a large centralised archive. In this paper, we report on the outcome of the clinical tests in MammoGrid and identify how far the project progressed towards real clinical use of a healthgrid. The paper identifies both achievements and obstacles in the use of a deployed grid-based healthcare application and it also highlights the lessons learned from MammoGrid. In addition, the exploitation plans for MammoGrid are outlined after the post-project era has begun for commercialisation of the project software.
2
The MammoGrid Clinical Evaluation
The MammoGrid project has several technical aspects, which are briefly mentioned in this section in order to convey a flavour of the scope and complexity of the challenges that were met. These aspects include: image standardisation using SMF; the development of a workstation on which images can be acquired, annotated, and uploaded to the grid; and the distribution of data, images and clinician queries across grid-based databases, while respecting ethical, legal, confidentiality and security constraints applicable differently in the partner countries of origin. It was not the intention of this project to produce new grid
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
307
software. Rather, the aim has been to use, wherever possible, open source software to provide middleware services to enable radiologists to query patient records across a widely distributed “federated” database of mammographic images and to perform epidemiological and computer-aided detection on the sets of returned images. For example, in the MammoGrid project, radiologists may annotate (i.e. mark out) different regions of a mammogram, which are then subjected to different computer-aided detection algorithms (including CALMA) and compared with stored mammograms in the database. Since any one of these stages may be executed independently or take some time to be completed, the process must be controlled in a way that recognises the current state of the computation and ensures that results are meaningfully assembled from the various partial outcomes. To provide for these possibilities, MammoGrid adopted AliEn (Alice Environment [15]) a lightweight grid middleware developed to satisfy the needs of the ALICE experiment at CERN for large scale distributed computing. The details of AliEn and how it was used in the design of MammoGrid are beyond the scope of this paper (c.f. [2]). The MammoGrid project has delivered its final proof-of-concept prototype enabling clinicians to store digitised mammograms along with appropriately anonymized patient metadata; the prototype provides controlled access to mammograms both locally and remotely stored. A typical database comprising several thousand mammograms has been created for user tests of clinicians’ queries. The prototype comprises x a high-quality clinician visualisation workstation (used for data acquisition and inspection); x an imaging standard-compliant interface to a set of medical services (annotation, security, image analysis, data storage and querying services) residing on a so-called 'Grid-box'; and x secure access to a network of other Grid-boxes connected through grid middleware.
Figure 1: The MammoGrid Virtual Organisation
To facilitate evaluation of the final prototype at the clinical sites, a MammoGrid Virtual Organisation (MGVO) was established and deployed (as shown in figure 1). The MGVO is composed of three mammography centres – Addenbrookes Hospital, Udine Hospital, and Oxford University. These centres are autonomous and independent of each other with respect to their local data management and ownership. The Addenbrookes and Udine hospitals have locally managed databases of mammograms, with several thousand cases between them. As part of the MGVO, registered clinicians have access to (suitably anonymized) mammograms, results, diagnosis and imaging software from other centres. Access is coordinated by the MGVO central node at CERN.
308
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
The service-oriented architecture approach (SOA) adopted in MammoGrid permits the interconnection of communicating entities, called services, which provide functionality through exchange of messages. The services are ‘orchestrated’ in terms of service interactions: how services are discovered, how they are invoked, what can be invoked, the sequence of service invocations, and who can execute them. The MammoGrid Services (MGs) are a set of services for managing mammographic images and associated patient data on the grid. Figure 2 illustrates the services that make up the MGVO, (For simplicity, the node at Oxford University has not been included). The MGs are: (a) Add for uploading files (DICOM [16] images and structured reports) to the MGVO; (b) Retrieve for downloading files from the grid system; (c) Query for querying the federated database of mammograms; (d) AddAlgorithm for uploading executable code to the grid system; (e) ExecuteAlgorithm for executing grid-resident executable code on gridresident files on the grid system; and (f) Authenticate for logging into the MGVO. For further details consult [3].
Figure 2: The MammoGrid Services in the MGVO.
Currently the MGVO encompasses data accessible to senior radiologists at the Addenbrookes and Udine hospitals, as well as researchers at Oxford University. The radiologists have been able to view raw image data from each other’s hospitals and have been able to second-read grid-resident mammograms and each to annotate separately the images for combined diagnosis. This has demonstrated the viability of distributed image analysis using the grid and shown considerable promise for future grid-based health applications. Despite the anticipated performance limitations that existing grid software and networks impose on system usage, the clinicians have been able to discover new ways to collaborate using the virtual organization. These include the ability to perform queries over a virtual repository spanning data held in Addenbrookes and Udine.
309
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
Figure 3: Clinical Query Handing in MammoGrid.
Clinicians define their mammogram analysis in terms of queries they wish to be resolved across the collection of data repositories. Queries can be categorised into simple queries (mainly against associated data stored in the database as simple attributes) and complex queries which require derived data to be interrogated or an algorithm to be executed on a (sub-)set of distributed images. The important aspect is that image and data distribution are transparent for radiologists so that queries can be formulated and executed as if these records were locally resident. Queries are executed at the location where the relevant data resides, i.e. sub-queries are moved to the data, rather than large quantities of data being moved to the clinician, which can be prohibitively expensive given the volume especially of image data. Figure 3 illustrates how queries are handled in MammoGrid. The Query Analyzer takes a formal query representation and decomposes it into (a) a formal query for local processing, and (b) a formal query for remote processing. It then forwards these decomposed queries to the Local Query Handler and the appropriate Remote Query Handler for the resolution of the request. The Local Query Handler generates query language statements (e.g. SQL) in the language of the associated Local DBMS (e.g. MySQL). The result set is converted to XML and routed to the Result Handler. The Remote Query Handler is a portal for propagating queries and results between sites. This handler forwards the formal query for remote processing to the Query Analyzer of the remote site. Finally the remote query result set is converted to XML and routed to the Result Handler. As of writing the MGVO holds:
Site Cambridge Udine
Number of Patients 1423 1479
Total Number of Image Files 9716 17285
Number of SMF Files 4815 8634
Associated Image Data Size 14 Mb 23.5 Mb
File Storage Size 260 Gb 220 Gb
Table 1: Virtual Repository size of the MammoGrid prototype.
The average processing time for the core services are: (1) Add 8Mb DICOM file takes approximately 7 seconds, (2) Retrieve 8Mb DICOM file from a remote site takes approximately 14 seconds, and (3) the SMF workflow of ExecuteAlgorithm takes around 200 seconds. Table 2 presents examples of queries and their execution results.
310
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
Query By Id: Cambridge patient By Id: Udine patient All female Age [50,60] and ImageLaterality=L
Cambridge 2.654 sec 2.844 103
Udine 2.563 sec 3.225 91
Num images 8 16 12571
Num patients 1 1 1510
19.489
22.673
1764
357
Table 2: Data query performance of the MammoGrid prototype.
In the final months of the project, clinicians have been testing the MammoGrid prototype functionality across two clinical studies. First, the Standard Mammogram Format (SMF) [4] software has been used to measure breast density. This clinical phase of the project, designed jointly by Cambridge and Udine explored the relationship between mammographic density, age, breast size, and radiation dose. In this phase, breast density has been measured by SMF and compared with standard methods of visual assessment. Heights, weights, and mass indicators were used in an international comparison, and the output demonstrated a richer dataset would be needed to study effects of lifestyle factors such as diet or HRT use between the two national populations. Second, the University of Udine led a project to validate the use of SMF in association with Computer Aided Detection (CADe) from the CALMA project [17] & [18]. Cancers and benign lesions were supplied from the clinical services of Udine and Cambridge to provide the benchmarking and the set of test cases. Cancer cases included women whose unaffected breast served the density study to provide cases for the CADe analysis from the affected side mammogram. Beyond the relevant clinical results obtained from these studies, the MammoGrid project has shown that these new forms of clinical collaboration can be supported using the grid (see [19] and [20]).
3
Lessons Learned
The nature of the project and its particular constraints of multi-disciplinarity, dispersed geographical development, large discrepancies in participants’ domain knowledge whether of software engineering techniques or of breast cancer screening practice, as well as the novelty of the grid environment, provide experiences from which other grid-based medical informatics projects can benefit. We summarize below some of the main lessons that can be learned in this context. First, the project was particularly fortunate in its medical partners. In general, the medical environment is highly risk-averse, conservative in nature and reluctant to adopt new technologies without significant evidence of tangible benefit. It is therefore important to identify a suitable user community in which new technologies (such as grid) can be evaluated. In the case of MammoGrid we have had real commitment from the radiology community in the project’s requirements definition, analysis, implementation, and evaluation and this was crucial to the success of the project. The data samples used were of a sensitive nature and required both ethical clearance from participating institutions and anonymization of the data and even then only for strictly research use in the project. Many ethical/legal obstacles remain to be tackled before clinicians can share sensitive patient data between institutes, never mind across national boundaries. Second, it has become clear from our experiences that grid middleware technology itself is still evolving, and this suggests that there is a clear need for standardization to enable production-quality systems to be developed. Despite the availability of toolsets such as the Globus 4.0 [21], the development of applications that harness the power of the grid at
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
311
present requires specialist skills and is thus still costly in terms of manpower. Only with the arrival of stable middleware and packaged grid services will the development of medical applications become viable. Third, the performance of existing middleware is also somewhat suspect; the MammoGrid project had to circumvent some of the delivered grid services to ensure adequate system performance for its prototype evaluation. For example, the database of medical images was completely decoupled from the grid software to provide adequate response for MammoGrid query handling. The EGEE [14] project is addressing these technological deficiencies and improved performance of the middleware should consequently be delivered in the coming years. Fourth, grid technology for medical informatics is still in its infancy and needs some proven examples of its applicability; MammoGrid is the first such exemplar in practice. Equally, awareness of grid technology and its potential (and current limitations) must still be raised in the target user communities such as Health, Biomedicine, and more generally life sciences. Fifth, the project has indicated that it is possible to use modelling techniques (such as use-cases from UML) in a widely distributed, multi-disciplinary software engineering problem domain, provided a very pragmatic approach is used, where adopting a certain modelling technique is, to some extent, independent from the software development life cycle model being applied. The MammoGrid project has benefited significantly in its coordination, communication and commitment by utilizing the use-case model as the lingua franca during user requirement analysis and system design rather than following the disciplines of the Rational Unified Process (RUP) to the letter; see [22] for a detailed account of this approach.. Furthermore, this has also demonstrated the possibility of bridging the gap between use-case models and grid-based service oriented architectures as demonstrated by the transition from the MammoGrid’s use-case model to its respective SOA architectural model. Sixth, the evolutionary approach to system development work packages has mitigated the effects of the project constraints of a highly dynamic research-oriented environment in which novices and specialists in software engineering have worked together even though they may have been geographically separated. Further areas that might promote the use of rigorous software engineering disciplines in the design of grid-based software services are that of model-driven engineering [23] and the use of architecture descriptions ([24] and [25]) as the basis for the generation of grid-wide services. These aspects are, however, outside the scope of the current project.
4
Future Exploitation Plans
By using grid computing, the MammoGrid system allows hospitals, healthcare workers and researchers to share data and resources, while benefiting from an augmented overall infrastructure. It supports effective co-working, such as obtaining second opinion, provides the means for powerful comparative analysis of mammograms and opens the door to novel, broad-based statistical analysis of the incidence and forms of breast cancer. Through the MammoGrid project, partners have developed a strong collaboration between radiologists active in breast cancer research and academic computer scientists with expertise in the applications of grid computing. The success of the project has led to interest from outside companies and hospitals, with one Spanish company, Maat GKnowledge, looking to deploy a commercial variant of the system in three hospitals of the Extremadura region in Spain.
312
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
Maat GKnowledge aims to provide doctors with the ability to verify test results, to obtain a second opinion and to make use of the clinical experience acquired by the hospitals involved in the project. They then aim to scale the system up and to expand it to other areas of Spain and then Europe. With the inclusion of new hospitals, the database will increase in coverage and the knowledge will increase in relevance and accuracy, enabling larger and more refined epidemiological studies. Therefore, clinicians will be provided with a significant data set to serve better their investigations in the domain of cancer prevention, prediction and diagnoses. This will result in improved research quality as well as improved citizen access to the latest healthcare technologies. The MammoGrid system prototype has been at the leading edge of the healthgrid revolution and implements for the first time such a solution for mammogram acquisition and manipulation. The resulting application has reached a high level of complexity which now requires continued partnership between academics, clinicians and industry to provide the necessary technology transfer and to enable real commercialisation. In this context, the MammoGrid Technology Transfer and Innovation eXchange (MaTTrIX) project has been proposed as an ideal means to transfer the project knowledge and expertise from the research to the commercial domain, to make its innovation available to the company Maat GKnowledge in Spain, to carry forward research findings to radiological practice and to reinforce existing partnerships between networks of clinicians and technologists. To achieve these objectives, the intention is to introduce a service which exploits the findings of the MammoGrid project in practice, while the researchers also learn from the application of their ideas in a real environment. This will create a two-way innovation flow between the academic and commercial worlds, building on existing synergies and collaborations, and improving the overall viability and commercialisation of the MammoGrid system software. To this end, the host organisation Maat GKnowledge, the University of the West of England, Bristol, CIEMAT (Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas), Spain, and CERN together with clinical partners at the University hospitals in Cambridge and Udine and at the Hospitals Infanta Cristina, Merida and San Benito in Extremadura are initiating a transfer of knowledge, research competences and technologies to enable the company to take over future development of the MammoGrid system. This will enable the training of existing Maat GKnowledge staff in MammoGrid technologies and the acquisition of the technology and know-how for commercialisation. This transfer would constitute the mechanism for the creation and development of a durable technology and knowledge transfer partnership. The MaTTrIX technology transfer approach relies on two main cornerstones: x to promote the mobility of key experienced researchers, to absorb, expand and disseminate the knowledge for the MammoGrid system to evolve gradually into a commercial offering, making it available for healthcare across the European Research Area (ERA) as a viable and clinically assessed solution; and x
to create and to develop a strategic and durable partnership between new hospitals and partners of the MammoGrid project, providing a sound foundation for a network of excellence in European research.
Having paved the way for potential knowledge discovery in the understanding of breast cancer, the MammoGrid project has also determined new important research paths for better cancer prediction and diagnosis. The use of a standard format for mammogram images (SMF) and its outcome in the epidemiological investigations, has demonstrated the relevance of new grid-based clinical studies, which may lead to major advances in cancer prediction. In addition, while most computer-aided detection (CADe) systems process raw
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
313
and noisy data at the price of accuracy and quality, the implemented solutions in MammoGrid have indicated the valuable joint use of SMF and CADe tools to improve automated cancer diagnosis. Considering the IT contribution in MammoGrid, the distributed technologies used in the project combined with the recent clinical feedback, obtained from the assessment of the proof-of-concept prototype, have highlighted the importance of offering a collaborative platform for healthcare. Not only would such a solution demonstrate the benefits of clinical second opinion, but it would also help in reducing the information infrastructure costs by enabling heterogeneous and scalable resource sharing, resulting in an improved system access even to less favoured regions and countries in Europe. Emphasising this last point the Hospitals of Infanta Cristina, Merida and San Benito in Extremadura in Spain, will obtain access to the MammoGrid system and expertise at a reduced cost, the infrastructure being already in place. While accessing the latest technologies and software, these hospitals will share and enrich their clinical experience by interacting with other trained clinicians from the university hospitals in Cambridge and Udine to obtain expertise in the use of SMF and CADe. Academic computer scientists will continue to analyse the ways in which new functionality is exploited by radiologists in their protocols and workflows, maintaining the design of the system under review. This new partnership will result in improved processes locally and also in refined clinical knowledge being made available in a Europe-wide production reference database, enabling new clinical studies in the spirit of improving cancer prediction and detection. It is hoped that this collaboration will provide a significant exemplar for the European Research Area.
5
Conclusions
The MammoGrid project has deployed its first prototype and has performed the first phase of in-house tests, in which a representative set of mammograms have been tested between sites in the UK, Switzerland and Italy. In the next phase of testing, clinicians will be closely involved in performing tests and their feedback will be reflectively utilised in improving the applicability and performance of the system. In its first two years, the MammoGrid project has faced interesting challenges originating from the interplay between medical and computer sciences, and has witnessed the excitement of the user community whose expectations from a new paradigm are understandably high. As the MammoGrid project moves into its final implementation and testing phase, further challenges are anticipated. In conclusion, this paper has outlined the MammoGrid application’s deployment strategy and experiences. Also, it outlines the strategy being adopted for migration to the new lightweight middleware called gLite [26]. MammoGrid is one of several current projects that aim to harness recent technological advances to achieve the goal of complex data storage in support of medical applications. The approaches vary widely, but at least two projects, e-Diamond in the UK and MammoGrid in Europe, have adopted the grid as their platform of choice for the delivery of improved quality control, more accurate diagnosis and statistically significant epidemiology. Since breast screening in the UK and in Italy has been based on film, mammograms have had to be digitised for use in both e-Diamond and MammoGrid. By contrast, the NDMA project in the United States has opted for centralised storage of direct digital mammograms. The next step for MammoGrid, its application in the Spanish region of Extremadura, will be based on both film and direct digital mammography. The central feature of the MammoGrid project is a geographically distributed, gridbased database of standardised images and associated patient data. The novelty of the
314
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid
MammoGrid approach lies in the application of grid technology and in the provision of data and tools, which enable radiologists to compare new mammograms with existing ones on the grid database, allowing them to make comparative diagnoses as well as judgements about quality. In the longer term, as the potential of the database is to be populated with provenance-controlled, reliable data from across Europe, with the prospect of statistically robust epidemiology that allows analysis of ‘lifestyle’ factors, including, e.g., diet, exercise and exogenous hormone use. And, hence the grid would also be suitable for storing genetic or pathological image information. The project has attracted attention as a paradigm for grid-based radiology and imaging applications. While it has not yet solved all problems, the project has established an approach and a prototype platform sharing medical data, especially images, across a grid. In loose collaboration with a number of other European medical grid projects, it is addressing the issues of informed consent and ethical approval, data protection, compliance with institutional, national and European regulations, and security [27], [28]. In conclusion, the MammoGrid project may be considered as a major advance in bridging the gap between the grid as an advanced distributing computing infrastructure and the medical domain and therefore should enable further grid-based projects to benefit from both its main lessons and its results. Acknowledgements The authors thank the European Commission and their institutes for support and acknowledge the contribution of the following MammoGrid collaboration members: Mike Brady and Chris Tromans (University of Oxford), Predrag Buncic, Pablo Saiz (CERN/AliEn), Martin Cordell, Tom Reading and Ralph Highnam (Mirada) Piernicola Oliva (University of Sassari) and Evelina Fantacchi and Alessandra Rettico (University of Pisa). The assistance from the MammoGrid clinical community is warmly acknowledged especially that from Iqbal Warsi and Jane Ding of Addenbrookes Hospital, Cambridge, UK and Dr Massimo Bazzocchi of the Istituto di Radiologia at the Università degli Studi di Udine, Italy. Last, but by no means least, the authors are indebted to the former MammoGrid Project Coordinator, Roberto Amendolia, both for his original innovative contribution and his robust support. References [1]
T. Hauer et al, “Requirements for Large-Scale Distributed Medical Image Analysis”, Proceedings of the 1st EU Healthgrid Workshop. pp 242-249. Lyon, France. January 2003.
[2]
F Estrella et al., “Resolving Clinicians Queries Across a Grids Infrastructure”, Methods of Information in Medicine Vol 44 No 2. 2005 pp 149-153. ISSN 0026-1270 Schattauer publishers.
[3]
S R Amendolia et al., “Deployment of a Grid-based Medical Imaging Application”. Studies in Health Technology & Informatics 2005;112:59-69. IOS Press ISBN 1-58603-510-X, ISSN 0926-9630
[4]
R Highnam & M Brady., Mammographic Image Analysis. 1st Ed. Dordrecht: Kluwer Academic Publishers; 1989
[5]
N Boyd et al., “Mammographic Density as a Marker of Susceptibility to Breast Cancer : a Hypothesis”, In: Publ IS, editor.,2001. pp163-169.
[6]
E Warner et al., “The Risk of Breast Cancer Associated with mammographic Parenchymal Patterns: A MetaAnalysis of the Published Literature to Examine the Effect of Method of Classification”. Cancer Detect Prev 1992; 16(1) pp 67-72
[7]
J W Byng et al., “Automated Analysis of MammoGraphic Densities and Breast Carcinoma Risk”. Cancer 1997; 80(1) pp 66-74
[8]
M. Bazzocchi et al., “Application of a Computer-Aided Detection (CADe) System to Digitized Mammograms for Identifying Microcalcifications”. Radiol Med (Torino) 2001; 1001(5) pp 334-340
C. del Frate et al. / Final Results and Exploitation Plans for MammoGrid [9]
315
I. Blanquer et al., “Clinical Decision Support Systems (CDSS) in GRID Environments” Studies in Health Technology & Informatics 2005;112:80-89 IOS Press ISBN 1-58603-510-X, ISSN 0926-9630
[10] J. L Oliveira et al., “DiseaseCard: A Web-Based Tool for the Collaborative Integration of Genetic and Medical Information. Biological and Medical Data Analysis”, 5th International Symposium, ISBMDA 2004, Barcelona, Spain, November 18-19, 2004 [11] GEMSS: Grid-Enabled Medical Simulation Services. http://www.gemss.de
[12] M. Brady et al., “eDiamond: A Grid-Enabled Federated Database of Annotated Mammgrams”. In: F. Berman GFTH, editor. Grid Computing: Making the Global Infrastructure a Reality: Wiley; 2003. [13] NDMA: The National Digital Mammography Archive. Contact Mitchell D. Schnall, M.D., Ph.D., University of Pennsylvania. See http://nscp01.physics.upenn.edu/ndma/projovw.htm [14] The Information Societies Societies Technology project: EU-EGEE, EU Contract IST-2003-508833; 2003. See http://www.eu-egee.org
[15] P. Saiz et al. “AliEn – ALICE environment on the GRID”. Nuclear Instruments and Methods A 502 (2003) 437-440, and http://alien.cern.ch [16] DICOM: Digital Imaging and Communications in Medicine. http://medical.nema.org
[17] I De Mitri., “The MAGIC-5 Project: Medical Applications on a Grid Infrastructure Connection”. Stud Health Technol Inform 2005;112:157-166 IOS Press ISBN 1-58603-510-x, ISSN 0926-9630. [18] S. Bagnasco et al., “GPCALMA: a Grid Approach to Mammographic Screening”, Nuclear Instruments And Methods In Physics Research Section A: Accelerators, Spectrometers, Detectors And Associated Equipment, Vol. 518, Issue 1, 2004, p. 394-98 [19] R Warren et al., “A Prototype Distributed Mammographic Database for Europe”. Under review at European Radiology. Springer-Verlag publishers. [20] R Warren et al., “A Comparison of Some Anthropometric Parameters Between an Italian and a UK Population : Proof of Principle of a European Project using MammoGrid”. Under review at European Radiology. Springer-Verlag publishers. [21] The Globus Toolkit 4.0. http://www.globus.org/toolkit/
[22] Odeh M, Hauer T, McClatchey R, Solomonides A. Use-Case Driven Approach in Requirements Engineering: the MammoGrid Project. In: Hamza MH, editor. Proceedings of the 7th IASTED Int. Conference on Software Engineering & Applications; November 2003; Marina del Rey, CA, USA: ACTA Press; November 2003. p. 562-567. [23] A. Kleppe, J. Warmer & W Bast., “MDA Explained: The Model Driven Architecture™: Practice and Promise”, Addison Wesley Professional (2003), and Rick Kazman, Steven G. Woods, S. Jeromy Carrière, “Requirements for Integrating Software Architecture and Reengineering Models: CORUM II”. Software Engineering Institute, Carnegie Mellon University [24] F. Oquendo, S. Cimpan & H Verjus., “The ArchWare ADL: Definition of the Abstract Syntax and Formal Semantics”. ARCHWARE European RTD Project IST-2001-32360. See also http://www.arch-ware.org
[25] D. Manset, R McClatchey, F Oquendo & H Verjus “A Formal Architecture-Centric Model-Driven Approach for the Automatic Generation of Grid Applications”. Accepted by the 8th Int. Conference on Enterprise Information Systems (ICEIS06) Paphos, Cyprus. May 2006. [26] gLite: Lightweight Middleware for Grid Computing. http://glite.web.cern.ch/glite
[27] Nørager S and Paindaveine Y. The Healthgrid Terms of Reference, EU Report Version 1.0, 20th September 2002. [28] From Grid to Healthgrid. Editors: T Solomonides, R McClatchey, V Breton, Y Legre & S Norager. Studies in Health Technology & Informatics Vol 112, ISBN 1-58603-510-X, ISSN 0926-9630 IOS Press. (Proceedings the 3rd Healthgrid International Conference (HG’05). Oxford, UK. April 2005).
This page intentionally left blank
Part VI Posters and Short Contributions
This page intentionally left blank
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
319
Proposing a roadmap for HealthGrids Vincent Breton1, Ignacio Blanquer2, Vicente Hernandez2, Yannick Legré1 and Tony Solomonidés3 1
LPC, CNRS-IN2P3, Campus des Cézeaux, 63177 Aubière Cedex, France 2 Universidad Politecnica de Valencia 3 University of West England, Bristol, Coldharbour Lane, Bristol BS16 1QY,United Kingdom
Abstract. With the regular progress of technology and infrastructures, a growing number of grid applications are developed and deployed for life science and medical research. At the last HealthGrid conference in April 2005 in Oxford, many groups described successful usage of grids for compute intensive calculations. Very large scale deployment of a biomedical application in the area of drug discovery has been achieved on EGEE during 2005. On the other hand, beside a few pioneers, very few data grids have been deployed so far and knowledge grids are still at a conceptual level. This situation is expected to evolve quickly as many projects are focussed on developing data management services and knowledge management tools relevant to biomedical sciences. At this stage, it is important to identify the potential bottlenecks and to define a roadmap for the wide adoption of grids for healthcare. This article presents an analysis of the present adoption of grids for biomedical sciences and healthcare in Europe: it identifies bottlenecks and proposes actions that will be further assessed within the framework of the SHARE European project dedicated to the definition of a roadmap for HealthGrids.
1. Introduction The emergence of grid technology opens new perspectives to enable interdisciplinary research at the cross roads of medical informatics, bioinformatics and system biology impacting healthcare. A HealthGrid is an environment where data of medical interest can be stored, processed and made easily available to the different actors of healthcare, physicians, healthcare centres and administrations, and of course citizens. If such an infrastructure offers all guarantees in terms of security, respect for ethics and observance of regulations, it allows the association of post-genomic information and medical data and opens up the possibility of individualized healthcare [1]. This enabling integration tool for medical applications provides also the infrastructure for navigation space. Access to many different sources of medical data, usually geographically distributed, and the availability of computer-based tools that can extract the knowledge from these data are key requirements for providing an equal healthcare provision of high quality. Born from discussions between grid application developers and medical informaticians, the concept of HealthGrid is now 3 years old. The yearly HealthGrid conferences are an opportunity to evaluate the growing usage of grids for life science and medical research. They allow also identifying the obstacles to a wider adoption. In
320
V. Breton et al. / Proposing a Roadmap for HealthGrids
chapter 2, we illustrate the concept of HealthGrid on a very simple example where we highlight key issues related to the deployment of grids for healthcare. In chapter 3, we propose an analysis of the present adoption of grids by biomedical sciences. Recent accomplishments are also critically reviewed. Based on this analysis, we will propose some actions to address the present bottlenecks. In chapter 5, we will describe the SHARE project which aims at proposing a roadmap for HealthGrids. While the SHARE project will address all dimensions of a roadmap including legal, social and ethical issues, this paper will restrict itself to technical issues.
2. Concept of HealthGrid: illustration by an example One of eHealth important goals is to allow the transfer of information between hospitals in Europe. A very simple example is a practitioner in Hospital 1 needing to transfer a patient Electronic Health Record (EHR) to Hospital 2 (figure 1). In this very simple use case, we consider that there is no legal issue for sake of simplicity.
Mediator
Hospital 1 EHR system 1
Hospital 2 EHR system 2
Figure 1.
To achieve this transfer, a first simple idea is to use a standard File Transfer Protocol. It will work only if the two hospitals EHR systems have the same data model. The EHR data model describes the content of each data field. If the data models are different, a mediator is needed to interpret the data coming out of the EHR system 1 and to translate it into the format used by the EHR system 2. This mediator is able to handle this translation provided the data models used by the 2 EHR systems are known. The mediator can not invent information so if the 2 EHR systems have different data fields, some data fields will not be filled or some data field may be unused and the data lost. This use case illustrates very simply different needs for the transfer of information between healthcare centres in Europe: x for Hospital 2 to request a patient record, it has to provide an identifier for this patient. This illustrates the need for a unique patient identifier allowing querying patient records while preserving their anonymity. x For the mediator to be able to translate patient record stored in Hospital 1, the data models of both EHR systems 1 and 2 must be known. Even if the two EHR systems can be completely different, the mediator will reorganize information as needed. EHR data models must be made publicly available.
V. Breton et al. / Proposing a Roadmap for HealthGrids
321
Even the precise definition of the data fields must be provided in order to allow the reliable translation. This requires a common vocabulary to define the data fields. x EHR system 1 has most probably specific data fields which have no equivalent for EHR system 2. Therefore some data fields will not be filled for the patient record at Hospital 2. However, it is of utmost importance to have the most important data fields filled. This requires an agreed patient summary with an agreed vocabulary to describe it. The HealthGrid is going to be the environment on which services and resources needed to enable the above picture are provided: x when hospital 2 looks for a patient record, it does not know necessarily that hospital 1 is holding the patient record it is looking for. An information service is needed to provide the localization of the patient records in Europe. This critical service must be constantly updated and needs to be replicated in order to avoid being a single point of failure. The information service needs to have the relevant security features so that only authorized healthcare professionals are allowed to consult it. x Another information service is needed to provide the data models for each healthcare centre storing medical patient record. This information service is consulted by the mediator before translating a patient record x a network of mediators is needed to address all the requests for patient record transfers in Europe. These mediators must also be updated to follow the evolution of the EHR data models This very simple example illustrates the role of a HealthGrid and the bottlenecks towards its deployment, including the interoperability of HER systems and the definition of a unique patient identifier and an agreed patient summary. These issues are presently addressed at a European level.
3. A perspective on the present adoption of grids Grids benefit from a large funding from the European Commission and the member states. Among the present projects, the ones relevant to health can be roughly classified in three categories: x infrastructure projects aim at offering a stable distributed environment for scientific production. Examples of such infrastructures are EGEE [2] and DEISA [3] in Europe. These infrastructures offer a generic multidisciplinary environment where biomedical applications can be deployed. x Technology projects aim at developing new grid-enabled services and environments relevant to the needs of life science and healthcare. Examples of such projects are SIMDAT [4] and MyGrid [5] x End user projects focus on specific life science or healthcare issues and integrate grid technology wherever they feel relevant. Examples of such projects are Mammogrid [6] and GEMSS [7].
322
V. Breton et al. / Proposing a Roadmap for HealthGrids
3.1. Adoption of grids for biomedical sciences Biomedical sciences have been identified very early as potential adopters of the grid technology. The wealth of data produced by life sciences in the last 10 years and its complexity requires more and more resources and services for their storage and analysis. Medical research is also evolving quickly with the generalized use of images and the growing integration of molecular biology in the perspective of individualized medicine. 3.1.1. Life science Molecular biologists are facing a daunting challenge: the relevance of their research requires a constant access to the databases containing all the knowledge acquired up today. Comparative analysis is a mandatory step in most of the molecular biology data analysis workflows. This analysis has to be frequently repeated to keep up with the exponentially growing volume of data stored in the databases. Comparative analysis is often the first step of complex workflows needed to extract information from the data in genomics, transcriptomics and proteomics. At a basic level, grids can help distribute the databases in order to make them accessible to the biologists [11] and provide the computing resources required by data analysis. Bioinformatics portals like GPS@ [9] are presently under development on top of grid infrastructures. The grid technology is also very promising to address biological data complexity. Indeed, the last years have witnessed the development of hundreds of databases providing specific representations of biological data. Interoperability of these databases is a key to the development of integrated approaches needed to start modelling living organisms. Projects such as Embrace [8] focus on addressing this interoperability issue using the grid technology. Other projects such as MyGrid [5] have been developing tools and environments to ease the design of data analysis workflows for biologists. The next step is to achieve the integration and deployment of these high level interfaces on grid infrastructures so as to offer to the biologists the data and computing resources needed for their analysis. 3.1.2. Medical research Grid technology entry points into medical research have been most often related to the need to manipulate large cohorts of medical images. The volume of medical images produced in European hospitals is comparable to the volume of data expected from the CERN Large Hadron Collider which is of the order of several Peta Bytes per year. Storing these images and running algorithms to extract their features require more and more resources. Attempts to distribute storage of medical image databases on the grid have been confronted with the very limited data management services made available on the grid infrastructures in Europe. Encouraging perspectives are opening with the addition of data management services on infrastructures like EGEE but adoption of grids in medical research depends heavily on the availability and extension of such services. Attempts to use grids to confront patient medical and biological data are presently under exploration in several projects presented at this conference. The success of these approaches depends again on the capacity of the grid to provide the tools needed to manipulate these data.
V. Breton et al. / Proposing a Roadmap for HealthGrids
323
3.1.3. Drug Discovery In silico drug discovery is one of the most promising strategies to speed-up the drug development process. Virtual screening is about selecting in silico the best candidate drugs acting on a given target protein. Screening can be done in vitro but it is very expensive as they are now millions of chemicals that can be synthesized. If it could be done in silico in a reliable way, one could reduce the number of molecules requiring in vitro and then in vitro testing from a few millions to a few hundreds. In silico drug discovery should foster collaboration between public and private laboratories. It should also have an important societal impact by lowering the barrier to develop new drugs for rare and neglected diseases. New drugs are needed for neglected diseases like Malaria where parasites keep developing resistance to the existing drugs or Sleeping sickness for which no new drug has been produced for years. New drugs against Tuberculosis are also needed as the treatment now takes several months and is therefore hard to manage in developing countries. In silico drug discovery on grids is a growing field. Grids like EGEE are ideally suited for the first step where docking probabilities are computed for millions of ligands. Grid relevance has been clearly demonstrated during the summer 2005 by the WISDOM initiative on malaria [12] where 46 million ligands were docked for a total amount of 80 CPU years (1 TFlop during 6 weeks). A foreseeable future is to enable a complete in silico drug discovery pipeline on the grid. Such pipeline would allow very quickly identifying promising compounds. The first stage, which will be explored notably within European projects like BioInfoGrid, EGEE and Embrace, is the deployment of a virtual screening platform that would take advantage of the European grid infrastructures for docking and of a supercomputer for Molecular Dynamics computations. 3.2. Adoption of grids for healthcare Adoption of grids for healthcare is still in its infancy. There are many reasons to this situation. A first obvious reason is that grid technology is still immature and is neither robust nor secure enough to offer the quality of service required for clinical routine. Another important reason is that all grid infrastructure projects are deployed on National Research and Education Networks which are separate from the networks used by healthcare structures. Another major obstacle is the legal framework in the EC member states which has to be evolved to allow the transfer of medical data on a European HealthGrid. This did not stop pioneer projects to explore and demonstrate the potential impact and relevance of grids to address such outstanding healthcare issues as the early diagnosis of breast cancer [6] or to improve radiotherapy treatment planning [7]. Grids are expected to bring a significant added value in the development of individual medicine which requires the exploitation of biological and medical data, but this is still a research field. Adoption of grids for healthcare will follow their adoption for life sciences and medical research provided the legal and ethical framework of the member states allows their deployment.
324
V. Breton et al. / Proposing a Roadmap for HealthGrids
4. Technical bottlenecks and proposed actions for a wider adoption of grids The HealthGrid vision relies on the setting up of grid infrastructures for medical research and healthcare. The present bottlenecks towards this vision are the following: x the availability of grid services, most notably for data and knowledge management x the deployment of these services on infrastructures involving healthcare centres such as hospitals, medical research laboratories, public health administrations x the definition and adoption of international standards and interoperability mechanisms for medical information stored on the HealthGrid The HealthGrid vision can not be achieved without a close collaboration of the projects developing grid middleware, deploying grid infrastructures and developing end-user oriented biomedical grid applications. 4.1. Technical bottlenecks Two worlds are today coexisting: the information world extensively using web services and the grid infrastructure world which is slowly migrating to the web services. Existing infrastructures in Europe are not yet based on this agreed standard because it takes years to develop a robust middleware and the migration to web services is a recent evolution of the grid standards. 4.1.1. Lack of grid data management services Adoption of grids for medical research and clinical routine depends on the capacity of grids to manipulate data in a secure and efficient way. Medical data are complex, highly sensitive and presented in multiple formats. Data management services offered by grid infrastructures must be very significantly improved in order to allow such manipulations. Importance of a large coordinated effort must be stressed to achieve this goal. 4.1.2. Lack of grid nodes in healthcare centres Another bottleneck is related to the installation and maintenance of grid nodes in healthcare centres. Such deployment is still in its infancy because the configuration of a grid node is rather complex and requires significant manpower. Moreover, as stressed above, secure services for data management are still under development. 4.1.3. Lack of standards in medical informatics Chapter 2 of this paper illustrated on a very simple example the role of a HealthGrid to exchange information between two hospitals in Europe. It also highlighted the need for a unique patient identifier allowing querying patient records while preserving their anonymity, for EHR data models publicly available and for an agreed patient summary with an agreed vocabulary to describe it. Work is under way at a European level to address these issues. For the HealthGrid vision to happen, standards must be agreed upon in the medical informatics community. This precludes the development of applications obeying to these standards, using the grid services and available from the grid nodes located in the healthcare centres.
V. Breton et al. / Proposing a Roadmap for HealthGrids
325
4.2. Organizational bottlenecks 4.2.1. Insufficient technology transfer between EC projects As a consequence of the technical bottlenecks previously identified, very few projects led by biomedical end users are deployed on the European grid infrastructures available today. This is due most notably to the limited data management services offered by the infrastructures, their still user-unfriendly interfaces and the lack of information and training on grids in the biomedical community. Interesting data management services are under development by some technology oriented projects but the mechanism by which they will be deployed on existing grid infrastructures is unclear. 4.2.2. Lack of coordinating bodies We have demonstrated in chapter 2 how a European infrastructure such as a HealthGrid depends on the definition of standards. These standards are needed to achieve interoperability of healthcare systems and records. The development of these standards requires coordination. The lack of agreed standards in medical informatics will be an obstacle to any large scale infrastructure deployment. The absence of a reference body or structure in charge of defining such standards is a clear bottleneck to the development of grid technologies in healthcare. 4.3. First proposed actions We recommend the creation of a dedicated infrastructure for medical research. From the beginning, the infrastructure should offer services such as database federations, distributed computing and data replication. Nodes of this infrastructure should be located in hospitals and healthcare centres. This infrastructure should host pilot medical research applications. A model for such an infrastructure is the BIRN project [13] of the National Institutes of Health's National Center for Research Resources. Launched in 2001 as an initiative, the BIRN is prototyping a collaborative environment for biomedical research and clinical information management. The growing BIRN consortium currently involves 30 research sites from 21 universities and hospitals that participate in one or more of three test bed projects: Morphometry BIRN, Function BIRN, and Mouse BIRN. These projects are centered around structural and/or functional brain imaging of human neurological disorders and associated animal models of disorders including Alzheimer's disease, depression, schizophrenia, multiple sclerosis, attention deficit disorder, brain cancer, and Parkinson's disease. BIRN is an end user driven project based on a robust middleware and it addresses all dimensions from capacity building to service development. It is important to have projects on the model of BIRN where user communities can build grid infrastructures. We also recommend to set-up a HealthGrid coordination body with a real power to make choices for standards and middleware deployment on this dedicated infrastructure.
326
V. Breton et al. / Proposing a Roadmap for HealthGrids
5. Proposing a roadmap for HealthGrid: the SHARE project European leadership on grid deployment is recognized at a world level. This leadership is also internationally acknowledged in the area of HealthGrid. The concept of grids for health was born in Europe in 2002 and has been carried forward through the HealthGrid initiative. This European initiative has edited, in collaboration with CISCO, a short version of the white paper setting out for senior decision makers the concept, benefits and opportunities offered by applying newly emerging Grid technologies in a number of different applications in healthcare. Starting from the conclusions of the White Paper, the EU funded Share project aims at identifying the important milestones to achieve the wide deployment and adoption of HealthGrids in Europe. The project will devise a strategy to address the issues identified in the action plan for a European e-Health [10]. It will also set up a roadmap for technological developments needed for successful take up of HealthGrids in the next 10 years. The widest audience will be solicited for comments and validation during most of the preparation phases. Grid infrastructures are designed at a world level and the consortium is therefore planning to involve at a later stage American and Asian participants in order for the resulting roadmap to have relevance beyond Europe. The HealthGrid roadmap will cover the domain of RTD and uptake of Grid applications in healthcare comprehensively, including infrastructure, security, legal, financial, economic and other policy issues. Each section of the roadmap will detail actions to be taken in terms of objectives and possible methods or approach as well as recommended milestones for completion, stakeholders responsible, appropriate methods of coordination etc. As a first view, the sections of the roadmap will cover the following domains: networks, infrastructure deployment, Grid operating systems, services to end users, standards requirements, security measures, legislative development and economic issues. The conceptual work during the start-up phase of the project will also specify in detail both the general scope and specific features of the roadmap. The roadmap will focus on identifying requirements for further research and technology development, but it will also sketch a realistic picture with respect to desirable applications/ICT implementations and indicate which technologies may have the potential to make a substantial contribution in this context. This will be supported through the presentation of good practice examples. To ensure that the RTD roadmap ultimately to be generated will actually yield positive results and desired impacts it will be based upon and, wherever possible, justified by empirical evidence from the research domain and a bottom-up assessment involving relevant stakeholders. In a sequential process, relevant research communities and communities of practice at EU, national and global levels will be joined up to enable an iterative refinement and extension of the initial road map. The HealthGrid roadmap is to be developed in a three stage process based on two iterations (roadmaps I & II) and one synthesis, resulting in a full-scale validated and integrated roadmap. The technical roadmap component has to address the different levels relevant to such an infrastructure: x The network must provide end-to-end high bandwidth connectivity between the Grid nodes. The services offered to the HealthGrid users will ultimately depend on the service level agreements between the network providers and the resources providers at each of the HealthGrid site.
V. Breton et al. / Proposing a Roadmap for HealthGrids
x
x
327
The Grid infrastructure is made of resources distributed geographically on the different Grid nodes. These resources share the Grid common operating system which is the hidden low-level part of the middleware, called sometimes “underware”. The services offered to the HealthGrid users depend on the functionalities offered by this operating system, the amount and nature of the resources made available to the Grid. At this “underware” level, most of the functionalities needed are common to all Grid infrastructures just like the DOS operating system used for PCs in hospitals is the same as for all other PCs. However, HealthGrid exceeds already e-science requirements at this level in areas such as security features for Access, Authentication and Authorization, performances and quality of service. The tools offered to the HealthGrid end users are made available through Grid interfaces. They are specific to medical research and healthcare. Their relevance, conviviality and performances are keys to the HealthGrid success. User friendliness of these services requires calling high level services taking care of knowledge management which themselves call lower level Grid services for access to distributed data and resources. Most of these high level middleware services, sometimes called upperware, are specific to HealthGrids.
In the definition of the roadmap, particular attention must be paid to security and standards in the choice of HealthGrid operating system and technology: x Security is not a choice but a mandate for HealthGrids. Security is an issue at all technical levels: networks need to provide protocols for secure data transfer, Grid infrastructure needs to provide secure mechanisms for Access, Authentication and Authorization, sites for secure data storage. The Grid operating system needs to insure access control to individual files stored on the Grid. High level services need to properly manage legal issues related to the protection of medical data. x Standards must be respected and promoted on the road to HealthGrids. Standards are needed for European wide compatibility and faster take-up. High level middleware services dealing with medical data need to conform to Grid standards but also medical informatics standards such as HL7.or DICOM. RTD activities to address issues limiting the full exploitation of HealthGrid technologies across Europe will be structured into a first version of the technology roadmap to be discussed at the HealthGrid conference in Valencia and submitted to the European Commission in the fall of 2006. . The roadmap will identify key short-term (2-5 years) and medium-term (4-10 years) RTD needs to achieve deployment of e-health systems in a Grid environment. It will also analyse unsolved RTD issues arising in the context of realistic approaches to priority clinical and public health settings (reflecting on models of use, benefits expected, concrete application experience and lessons learned; relevance of open source model) and detail actions to be taken for networks, infrastructure deployment, Grid operating systems, services to end users, standards requirements and security measures This first roadmap will recommend a number of case studies on specific aspects of technology issues requiring further investigation because they are identified as potential bottlenecks.
328
V. Breton et al. / Proposing a Roadmap for HealthGrids
Its recommendations will be validated against several use case scenarios. As a result of this validation, new technological bottlenecks should be identified, requiring further RTD activities and a revision of the proposed technology roadmap. The revised roadmap will implement a process to present, discuss, and validate the identified RTD needs and the resulting roadmap with the relevant RTD community. Actors of the Grid development will be asked to validate and prioritise areas of future work on the basis of highest expected short and medium term impact. Their endorsement is critical to the successful achievement of the proposed roadmap at the levels which are hidden to the user: networks, infrastructure deployment and Grid operating systems. Security as well must be implemented at all levels. The project technology partners will present and promote the revised roadmap in the different consortia where they are involved (EGEE, DEISA, UK e-science, Globus, national Grid initiatives…) to trigger RTD activities identified.
6. Conclusion This paper aimed at giving an overall analysis of the present status of HealthGrids in Europe. Through the simple example of the transfer of a patent health record between two hospitals, we have demonstrated the importance of a unique patient identifier allowing querying patient records while preserving their anonymity, the need for EHR data models publicly available and for an agreed patient summary with an agreed vocabulary to describe it as well as for interoperability mechanisms. We have also stressed the need for improved data management services on grid infrastructures. Indeed, the last HealthGrid conference has witnessed several success stories in the usage of grids for compute intensive tasks but data grids are still to come. The analysis work started in this document will be further developed and enlarged to social, legal and ethical issues within the framework of the EU funded Share project in order to produce a roadmap for the adoption of HealthGrids in Europe.
Acknowledgments Many of the ideas expressed in this document have been further refined in discussions with members of the HealthGrid consortium. We particularly acknowledge fruitful exchanges with Veli Stroetman and Sofie Nørager.
References [1] V. Breton, K. Dean and T. Solomonides, editors on behalf of the HealthGrid White Paper collaboration,”The HealthGrid White Paper”, Proceedings of HealthGrid conference, IOS Press, Vol 112, 2005 [2] Fabrizio Gagliardi, Bob Jones, François Grey, Marc-Elian Bégin, Matti Heikkurinen, "Building an infrastructure for scientific Grid computing: status and goals of the EGEE project". Philosophical Transactions: Mathematical, Physical and Engineering Sciences, Issue: Volume 363, Number 1833 / August 15, 2005, Pages: 1729 – 1742, DOI:10.1098/rsta.2005.1603 [3] DEISA, http://www.deisa.org [4] SIMDAT, http://www.scai.fraunhofer.de/simdat.html [5] MyGrid, http://www.mygrid.org.uk/ [6] Mammogrid, http://mammogrid.vitamib.com/
V. Breton et al. / Proposing a Roadmap for HealthGrids
329
[7] GEMSS, http://www.gemss.de/ [8] Embrace, http://www.embracegrid.info [9] GPS@, C. Blanchet et al, proceedings of HealthGrid conference, IOS Press, Vol 112, 2005 http://gpsa.ibcp.fr/ [10] Action plan for a European e-Health Area, COM(2004) 356, European Commission, http://europa.eu.int/information_society/doc/qualif/health/COM_2004_0356_F_EN_ACTE.pdf [11] J. Salzemann, V. Breton, N. Jacq and G. Le Mahec, “Replication and Update of molecular biology databases in a grid environment”, submitted to FGCS, 2006 [12] N. Jacq, J. Salzemann, Y. Legré, M. Reichstadt, F. Jacq, M. Zimmermann, A. Maas, M. Sridhar, K. Vinodkusam, H. Schwichtenberg, M. Hofmann and V. Breton, In silico docking on grid infrastructures: the case of WISDOM, submitted to FGCS, 2006. http://wisdom.eu-egee.fr [13] BIRN: http://www.nbirn.net/
330
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Remote Radiotherapy Planning: The eIMRT Project Andrés GÓMEZa, Carlos FERNÁNDEZ SÁNCHEZa, José Carlos MOURIÑO GALLEGOa, Francisco J. GONZÁLEZ CASTAÑOb, Daniel RODRÍGUEZ-SILVAb, Javier PENA GARCÍAc, Faustino GÓMEZ RODRÍGUEZc, Diego GONZÁLEZ CASTAÑOc, Miguel POMBAR CAMEÁNd a
Fundación Centro Tecnológico de Supercomputación de Galicia (CESGA), Santiago de Compostela, Spain {agomez,carlosf,jmourino}@cesga.es b Departamento de Ingeniería Telemática, University of Vigo, Spain {javier,darguez}@det.uvigo.es c Departamento de Física de Partículas, University of Santiago de Compostela, Spain {javierpg,faustgr}@usc.es d Hospital Clínico Universitario de Santiago, Santiago de Compostela, Spain mrpombar@usc.es Abstract. In this paper, we present the eIMRT project which is currently carried out by diverse institutions in Galicia (Spain) and the USA. The eIMRT project will offer radiotherapists a set of algorithms to optimize and validate radiotherapy treatments, both CRT- and IMRT-based, hiding the complexity of the computer infrastructure needed to solve the problem using GRID technologies. The new platform is designed to be independent from the medical accelerator models, scalable and open. Having a web portal as client, it is designed in three layers using web services, which will allow users to access the platform directly from any front-end and client. It has three main components, namely remote characterization of linear accelerators for Monte Carlo and convolution/superposition (C/S) dosecalculation techniques, remote Grid-enabled radiotherapy treatment planning optimization and verification and data depository. Keywords. Radiotherapy, Monte Carlo, treatment plan optimization and verification, CRT, IMRT, gLite
1. Introduction Current radiotherapy treatment planning is based on local software tools (such as Pinnacle, XiO, Oncentra, Corvus, etc.), running on workstations at hospital premises. Specialized personnel compute treatment plans either employing their previous knowledge and experience or trial-and-error class-solution or, for more complex treatment plans, built-in optimization tools. These software tools, called treatment planning systems (TPS), are subject to very severe constraints on computer power and time to produce practical results, due to hospital workload and limited access to new algorithms. The requirements in maximum computation time forces TPS tools to perform approximations both in dose calculation engines and optimization algorithms. The most accurate dose calculation techniques of that those codes are based on
A. Gómez et al. / Remote Radiotherapy Planning: The eIMRT Project
331
convolution/superposition (C/S) [1] techniques, which suffer from certain limitations in high density gradient regions. Usually, the treatment is based on Conformal Radio Therapy (CRT) which, from a preselection of incident angles, fixed beam energy and exposure time,, conform the beam to the shape of the tumour for each angle. The more recent Intensity Modulated Radiation Therapy (IMRT) techniques select the intensity for each incident angle in great detail. State-of-the-art conformal and complex radiotherapy treatments such as IMRT are calculated and optimized employing TPS tools. Treatments are tailored to maximize the dose to the planned target volume (PTV) while minimizing the dose to surrounding tissues, specially the organs at risk (OAR), within the limits specified by the doctors. This is the main issue in the definition of the objective function of the optimization problems involved, which are solved using several well-known techniques such as simulated-annealing [2,3], linear programming [4] or mixed integer programming [5]. The limitations and drawbacks of current TPS could be avoided if computationally intensive user-friendly environments were available. The eIMRT project focuses on that issue by integrating and implementing several tools to help radiotherapists in selecting the best treatment. The project started in Summer 2005 and will deliver the first services in 2006. Fully operational service is expected by the end of 2008. The project partners are CESGA, University of Santiago de Compostela, University of Vigo and Complexo Hospitalario Universitario de Santiago (all of them in Galicia, Spain) with the collaboration of the Computer Sciences Department of the University of Wisconsin-Madison. At the end of the project, a single portal will offer radiotherapists several techniques to optimize treatments and verify them, following the new Service Oriented Architectures (SOA) paradigm. At the first stage, we plan to include the following tools: • Monte Carlo methods [6] for treatment verification. They accurately model the interaction of radiation with matter and are the dominant standard in dose calculation techniques. They achieve more accurate results than convolution/superposition methods, at a higher computational cost. Although forthcoming achievements may render Monte Carlo a valid near-real-time treatment planning alternative, it currently plays an outstanding role as a validation technique. Therefore, the eIMRT platform will use it to verify the results of commercial treatment planning systems for any kind of radiotherapy plan of external photon beams. • CRT and IMRT optimization algorithms. New web services will implement an open access point to exhaustive yet computationally intensive IMRT and CRT optimization algorithms based on mixed Monte Carlo C/S dose computation algorithms, which will produce optimized treatment plans of a quality (in terms of dose conformation and organ sparing) hardly achievable with commercial TPS. • Finally, there is a joint effort of the groups involved in this project to establish an international public data repository with anonimyzed CT scans, treatments and other relevant information, which may be useful in the future to reproduce results, mine knowledge and train specialists. Both Monte Carlo and IMRT optimization methods are computationally intensive and, furthermore, demand radiotherapists to acquire knowledge in fields that are beyond their usual experience, such as mathematical programming. A good solution to both problems is software decoupling, i.e. leaving the user interface in local machines (at the hospital in our case) and taking the computing core to institutions that can effectively handle it. Software decoupling for user-friendly high-throughput computing
332
A. Gómez et al. / Remote Radiotherapy Planning: The eIMRT Project
is not new [7] but, as far as we know, it is an original approach in the field of radiotherapy planning. Nowadays, approximately 50% cancer patients receive radiotherapy. In 1999, more than 56.000 patients were irradiated only in Spain [8]. All treatments follow a planning protocol to ensure the quality and effectiveness of the session, and the treatments should be planned in a short period of time (the mean time between the first visit and the begining of radiotherapy treatment is 18,87 days in Spanish public hospitals [8]). Many radiotherapists have to plan over 600 to 1200 patients per year, with a mean value of 925 [9]. This situation puts a high pressure on them, raising the need of new tools to optimize. The tool we propose may have a significant impact due to the high number of hospitals that may benefit from it. Only in Spain, there are over 115 particle accelerators in 70 hospitals with radiotherapy facilities [10], four of them in Galicia including the Complexo Hospitalario Universitario de Santiago, which is a partner in the project. As a result of the extremely long CPU time for optimization and verification, distributed computing is a clear best-case of Grid technologies exploitation.
2. Architecture Figure 1 shows a general overview of the eIMRT architecture. Note that the client interface is rather simple (HTML, Flash and Java support). Complexity is completely enclosed at the server side, which accesses high-throughput computing services via web services [11]. We plan to use the gLite middleware [12], since one of the project partners (Centro de Supercomputación the Galicia) is also an EGEE partner. We foresee three types of users, with different privilege levels. Figure 2 summarizes them with the interactions they may perform. CESGA HOSPITAL
Condor DB
INTERNET Router Client Web Interface HTML + Flash + Applets
gLite Router Web SOAP Server Client
SOAP Server
Web Services
Figure 1. High-level eIMRT architecture.
Computing Interface ...
A. Gómez et al. / Remote Radiotherapy Planning: The eIMRT Project
Global admin
Accelerator management
Algorithm management
Hospitals mgment.
Hospital admin
Accelerator selection
Accelerator characterization
Users mgment.
Medical physicist
Treatment validation
Treatment planning
333
Repository management
Repository inquiry
Figure 2. User types and their interactions
Most interactions in Figure 2 are self-explicative. To validate a treatment, the user requires the system to check the dose distribution he has calculated (for instance, with a local tool) against the dose distribution associated to the same treatment resulting from a more accurate method (Monte Carlo at the current stage). If the radiotherapist provides the system with a dose map, he obtains the gamma maps [13] (which account for local dose differences). In any case, he submits a treatment file (typically DICOM RTplan, RTP CONNECT or MLC), and he is allowed to visualize all input information before proceeding (to avoid possible mistakes in file identification). Regarding treatment optimization, at this moment we are considering CRT-enhancement and IMRT planning via integer programming and global optimization algorithms, which are completely transparent to the radiotherapist. However, the platform has been designed to implement any optimization algorithm in the future. Treatment optimization returns dose data, as well as dose and gamma maps, so the radiotherapist can take decisions from the results. Due to the length of the interactions, the system has session support, to let radiotherapists leave the system and track the progress of their processes. They may enter again at any time to retrieve the results or initiate new interactions. The system server hosts the web server for user access. As shown in figure 3, the web server calls web service SOAP clients, according to the petitions that take place during authenticated web sessions. These SOAP clients do not create an internal session to access web services, but send authenticated messages 1 instead to the corresponding services. After performing the requested operations, each service returns its results. We employ Cocoon [14] to combine the web server with the SOAP client, so that the user may access services directly with XML. Besides, it is straightforward to convert XML responses from web services to HTML and other formats for the enduser to visualize them.
1
Using WS-Security.
334
A. Gómez et al. / Remote Radiotherapy Planning: The eIMRT Project
e-IMRT client Web client
Web server
SOAP client Web Services
Other WS clients
SOAP client
Figure 3. Relationships between elements participating in web services invoking
Although it is feasible to place the web server and the web services in the same machine, it is likely that, once published, the web services will be reachable from other machines (with their own SOAP clients). Thus, authentication is mandatory.
3. Conclusion The eIMRT decoupled architecture is a cost-effective solution to speed-up the CPU-greedy processes in advanced radiotherapy planning: accelerator characterization, treatment validation and treatment optimization. As a bonus, it hides implementation details, freeing radiotherapists from acquiring non-essential technical knowledge. Last but not least, it lowers maintenance costs, since eIMRT clients run on standard browsers of low-end machines and operating systems. At the present stage, the server side will completely run at the CESGA supercomputing facilities, but the system is ready to become fully distributed across heterogeneous networks, since it relies on Grid technology
Acknowledgment This research has been funded by the PGDIT05SIN00101CT grant (Xunta de Galicia, Spain) and partially by FSE.
References [1] [2] [3] [4] [5]
T.R. Mackie, J.W. Scrimger, and J.J. Batista, “A convolution method of calculating do for 15-MV x rays”, Med. Phys. 12, 188-196 (1985) G.S. Mageras and R. Mohan, “Application of fast simulated annealing to optimization of conformal radiation treatments”, Med. Phys. 20, 639-647 (1993) S. Webb, “Optimisation of Conformal radiotherapy dose distributions by simulated annealing”, Phys. Med. Biol., 1989, Vol.34, No 10, 1349-1370. I. I. Rosen, R. G. Lane, Steven M. Morrill and James A. Belli, “Treatment plan optimization using linear programming”, Med. Phys. 18, 141-152 (1991) M. C. Ferris, R. R. Meyer, W. D'Souza, “Radiation Treatment Planning: Mixed Integer Programming Formulations and Approaches”, Optimization Technical Report, Computer Sciences Department, University of Wisconsin-Madison, 2002.
A. Gómez et al. / Remote Radiotherapy Planning: The eIMRT Project
[6] [7]
[8] [9]
[10] [11] [12] [13] [14]
335
P. Andreo, “Monte Carlo techniques in medical radiation physics”, Phys. Med. Biol. 36, No 7, 861-920 (1991) F. J. González-Castaño et al, “A Java-CORBA Virtual Machina Architecture for Remote Execution of Optimization Solvers in Heterogeneous Networks”, Software Practice & Experience 31 (2001), pp. 116. G. López-Abente Ortega et.al, “La situación del cáncer en España”. Ministerio de Sanidad y Consumo. 2005. ISBN: 84-7670-673-1 A. Iglesias Lago, Planificadores 3D y simulación virtual del tratamiento. Situación en España. Supervivencia asociada a su aplicación. Santiago de Compostela: Servicio Galego de Saúde, Axencia de Avaliación de Tecnologías Sanitarias de Galicia, avalia-t; 2003. Serie Avaliación de tecnoloxías. Investigación avaliativa: IA2003/01 Catálogo Nacional de Hospitales 2005, Ministerio de Sanidad y Consumo. (http://www.msc.es/ciudadanos/prestaciones/centrosServiciosSNS/hospitales/home.htm) Web Services, http://www.w3.org/2002/ws/ http://glite.web.cern.ch/glite/ D. A. Low, W. B. Harms, S. Mutic, J. A. Purdy, “A technique for the quantitative evaluation of dose distributions”, Med. Phys. 25, 656-661 (1998) The Apache Cocoon Project. http://cocoon.apache.org/
336
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems John Geddesa, Clare Mackaya, Sharon Lloydb, Andrew Simpsonb , David Powerb, Douglas Russellb, Marina Jirotkab, Mila Katzarovab, Martin Rossorc, Nick Foxc, Jonathon Fletcherc, Derek Hilld, Kate McLeishd, Yu Chend , Joseph V Hajnale, Stephen Lawrief, Dominic Jobf, Andrew McIntoshf, Joanna Wardlawg, Peter Sandercockg, Jeb Palmerg, Dave Perryg, Rob Procterh, Jenny Ureh,1, Mark Hartswoodh, Roger Slackh, Alex Vossh, Kate Hoh, Philip Bathi, Wim Clarkei, Graham Watsoni Department of Psychiatry, University of Oxford, bComputing Laboratory, University of Oxford, cInstitute of Neurology, University College London, dCentre for Medical Image Computing (MedIC), University College London, eImaging Sciences Department, Imperial College London, fDepartment of Psychiatry, University of Edinburgh, gDepartment of Clinical NeuroSciences, University of Edinburgh, hSchool of Informatics, University of Edinburgh, iInstitute of Neuroscience, University of Nottingham a
Abstract. The paper draws on a number of Grid projects, particularly on the experience of NeuroGrid,, a UK project in the Neurosciences tasked with developing a Grid-based collaborative research environment to support the sharing of digital images and patient data across multiple distributed sites. It outlines recurrent socio-technical issues, highlighting the challenges of scaling up technological networks in advance of the regulatory networks which normally regulate their use in practice.
Keywords. E-Health, medical imaging, neuroscience, problem scenarios in distributed data-sharing, socio-technical system design
1. Introduction There is an increasing drive within the UK to integrate healthcare data and services. The vision of ‘joined-up’ healthcare envisages services being delivered to patients through flexible – and perhaps virtual – organisational structures formed around networks of healthcare professionals working within, and across, multiple service units and administrative domains. Similarly, translational medical research focuses on reducing the turn-around time in the cycle that leads from identification of possible causes of illness (for example, particular genetic and/or environmental factors) to the 1
Corresponding Author: Jenny Ure, School of Informatics, University of Edinburgh, Jenny.Ure@ed.ac.uk
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
337
investigation of disease mechanisms and the development of treatments, through to clinical trials and practice [1]. The realisation of this agenda is constrained by a range of recurrent issues and problem scenarios that have been given priority as one of the eScience Grand Challenges. We discuss some of these issues in relation to the development of two Grid projects using distributed digital imaging and patient data across distributed sites. 1.1. Health Services Healthcare services and research infrastructure in the UK and Europe are in a process of transition. The vision of translational and evidence-based medicine depends on a seamless infrastructure from lab-based research results to clinical applications. The reality however, is a patchwork of disjoint technical, professional and administrative architectures, a diversity of criteria and clinical protocols for data acquisition, a range of coding standards and differing guidelines for clinical practice and trial management. Furthermore, e-Health initiatives to streamline services, such as electronic patient records, are already generating debate over issues associated with the cost, benefits, quality and dependability of these services, the potential implications for patient confidentiality, and the potential risks in clinical applications. The collective consequence of these factors is that even modest levels of system and information integration have proved difficult to achieve in practice in healthcare [2], [3].
2. NeuroGrid NeuroGrid2 is a three-year, £2.1M project funded through the UK Medical Research Council to develop a Grid-based collaborative research environment for imaging in large scale studies for neuropsychiatric disorders in the UK. It will be developed around three component clinical exemplars in stroke, dementia and psychosis, and complex services for quantitative and qualitative image analysis. This project, which started in March 2005, has a project team distributed across 11 sites in the UK, bringing together the work of clinicians, clinical researchers and e-scientists at Oxford, Edinburgh, Nottingham and London, using a Grid-based architecture to address the different needs of each node. 2.1. Objectives The project aims to enable rapid, reliable and secure data sharing through interoperable databases, with access control management and authentication mechanisms. It will also provide a toolset to facilitate image registration and analysis, normalization, anonymisation, real-time acquisition and error trapping, to improve reliable diagnosis, to compensate for scanner differences and to allow quality and consistency checks before the patient leaves the imaging suite. The exemplar teams will use the infrastructure to address issues specific to their own domain of interest, as well as generic issues in the design of distributed Grid-based systems, and the aggregation and use of data in multi-site clinical trials. 2
www.neurogrid.ac.uk
338
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
The requirements of the three clinical exemplars groups (stroke, psychosis and dementia) use the potential of the Grid in very different ways – from the creation of enhanced datasets for rare conditions, to the use of Grid-enabled tools for image acquisition, archiving and analysis, and to the analysis of variance in technical and human processes associated with data collection, curation, processing and uploading. This provides a range of opportunities for evaluating the potential of Grid-based applications in the neurosciences, as well as more generally in e-Health and eScience [4]. Many of the issues addressed in the paper have been mirrored in other UK HealthGrid projects, most notably eDiaMoND3 [5, 6, 7, 8] which was a flagship pilot UK e-Science project on medical imaging in the context of breast screening, funded through EPSRC/DTI and IBM SUR grants to build a grid-enabled, federated database of annotated, digitised mammograms and patient information intended to aid research, into and detection and treatment of breast cancer.
3. Recurring Scenarios We will discuss a range of issues which appear to be significant hurdles in the vision of e-Health, including translational research and large-scale clinical trials, using a number of prototype grid-enabled applications to exemplify recurrent problem scenarios. Many of the issues are arguably evident in other distributed networked systems in e-Business and e-Learning, for example, where scaling up of technical architectures has not been matched by a corresponding alignment of the local coordination and governance structures in heterogeneous and distributed local communities. Although we will draw on other projects, the focus is on those issues which have been most prominent in NeuroGrid in the first year: • • • • • •
Aggregating data collected in different ways, for different purposes, from very diverse and distributed contexts Representing this data in ways which are meaningful and useful to communities with very different aims and frames of reference Managing clinical trials and associated ethical permissions and protocols across multiple communities, and for multiple purposes Aligning local aims and requirements with collective ones Aligning technical and human networks to advantage Integrating the technical work of system building, with the socio-political work of generating collective structures and agreements for the governance of the new risks and opportunities generated
4. Issues in Grid-based Medical Imaging Radiological imaging in large–scale clinical trials promises substantial benefits in the diagnosis and assessment of specific treatment effects on pathological processes. The Grid offers a mechanism for further extending the size of datasets available for analysis, and for enhancing the speed and quality of analysis that can be performed on 3
www.ediamond.ox.ac.uk
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
339
them. Researchers use innovative imaging techniques to detect features that can refine a diagnosis, classify cases, track normal or often subtle patho-physiological changes over time and improve understanding of the structural correlates of clinical features. Some of the variance is attributable to a complex variety of procedures involved in image acquisition, transfer and storage, and it is crucial, but difficult, for true diseaserelated effects to be separated from those which are artifacts of the process. There are two basic approaches to the extraction of detailed information from imaging data invoking different sets of challenges: •
•
Automated and computationally intensive image analysis algorithms for quantification and localization of signal differences have particular value in longitudinal imaging studies of change over time. This is particularly useful in identifying changes associated with the onset of psychosis, dementia or Alzheimer’s disease, but particularly challenging in the harmonization of technical processing – as in the use of different scanners for example. Assessment by healthcare professionals, as in large randomised controlled trials or observational studies, uses imaging to distinguish between different underlying causes (e.g., stroke or psychosis can both be associated with similar behavioural presentation), to assess severity, progression or response to treatment, and may require collection, storage and dissemination of data from hundreds of centres. A particular challenge here is the intra and inter-site variance across raters.
Imaging research is traditionally often carried out in small studies in single research centres, where much of the knowledge about provenance, reliability and use is grounded in shared local knowledge, aims and contexts. Researchers and clinicians share an intimate understanding of the potential risks of combining local datasets for clinical purposes, based on a knowledge of the protocols and processes that could have contributed to the outcomes – which scanner, which control group, which protocol, etc, Scaling up technical systems has, in practice, been easier that scaling up the sociotechnical and socio-political processes governing the collection, analysis, representation and use of data outwith it’s context of origin[9]. As with the introduction of networked technology in education, new possibilities and new responsibilities associated with governance and use in practice have led to reconsideration of the nature of the processes and purposes of e-Health and e-Science systems, and the roles and responsibilities of the stake-holding entities within this [10]. The realization of a sustainable and reliable system will depend on bridging the gap between the vision of seamless integration and the more disjoint reality on the ground highlighted in the recent Healthgrid White Paper [11].
5. Data Quality Issues The large scale aggregation of diverse datasets offers both potential benefits and risks, particularly if the outputs are to be used with patients in a clinical context. Thus aggregating data is a key issue for e-Health, yet data is not independent of context in which it is generated. Within small communities of practice a degree of shared and updated knowledge and experience allows judicious use of resources whose provenance is known and whose weaknesses are often already transparent. The same is
340
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
not true of aggregated data from multiple sources, where the process of deriving and coding may vary in both explicit and less obvious ways, even within communities of practice. One approach is to make early use of prototypes to provide a ‘sandpit’ for promoting both technical and inter-community dialogue and engagement, and start the process of identifying, sharing and updating knowledge of emerging issues. The approach in Neuro Grid has been to focus early on trials with known datasets to generate an awareness of the types of variance that can arise and ways in which it might be minimized, harmonized, or made transparent to users given the ethical implications of use in the clinical domain. This will include technical differences between scanners, differences in use of protocols or in data input, differences in rating of images where these are not automated and differences in the administration of psychiatric tests such as the PANNS4 test. 5.1. Data Collection: the Dementia and Stroke Exemplars Multi-site clinical trials add additional complexity with the need to coordinate such issues as naming conventions for files, patient clinical trial ID management and acquisition parameters. The NeuroGrid dementia exemplar group involves researchers from the Institute of Neurology in London, and from University College London, who aim to use the Grid infrastructure to collect a new dataset and to develop methods of measuring image quality whilst the patient is still in the scanner, such that adjustments can be made in real time while the patient is still available, thus cutting the cost of and delay in re-scanning and optimizing the reliability of the dataset. Data being collected includes baseline demographic data (age, gender, but not any identifiable data), digital scans for each of the time-steps and outcome information about these cases, associated with each timestep. Data curation involves documenting the acquisition, processing, archiving, retrieval, aggregation and use of this medical data. Working across sites and databases has highlighted how differences and mismatches occur, for example, in matching patient data to images or in labeling sequences of scans, and in some cases where staff fill in forms incorrectly. While there are regulatory requirements for good clinical practice and elaborations on these, 5 the ways in which a particular trial can tackle these problems remains to be worked out, and there is considerable uncertainty as to how regulatory requirements can be effective when translated into practice. Aggregating multiple datasets, in the e-Health context thus has implications for both accuracy and clinical diagnosis and treatment. In practice, a number of small scale responses are beginning to emerge in different nodes •
•
4
One group has adopted the use of tablet PCs for clinical staff to input data, using a wireless link to the relevant database, so that mismatches these can be rectified at the point of input, using the functionality of Microsoft Infopath to highlight mismatches when cross referenced to the database. The Stroke exemplar group in Edinburgh and Nottingham are developing error-management software that uses multiple measures of triangulating in on
The Positive and Negative Syndrome Scale, (PANSS), is a 30 item assessment of positive and negative psychiatric symptoms 5 EU Clinical Trials Directive, Directive 2001/20/EC
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
•
341
patient data to query mismatches between images and patient records from the multiple acquisition sites. The Psychosis exemplar group have generated harmonisation software for differences between scanners and Grid-enabling these algorithms will allow sharing of this across sites. Studies will also be done on the interpretation and use of clinical tests to identify a measure of the variance that can be expected as a result of differences with and between clinicians in the diagnosis of psychosis.
Part of the benefit of an early prototype is the opportunity to run trials to identify the parameters of variation across sites under different constraints and conditions, and use test data sets to evaluate the quality, validity and reliability of aggregated datasets. The ability to distinguish clinically significant differences in images from those that are artefactual is critical for the success of NeuroGrid, and early testing of the prototype will allow early engagement with this issue. The psychosis exemplar group in Edinburgh and Oxford will be testing software for harmonization across scanners at different sites. The dementia exemplar group will be evaluating the quality and speed of processing using a Grid-enabled toolkit, and the stroke exemplar group in Edinburgh and Nottingham, will utilise existing datasets to test Grid-enabled software for use with images from CT and MR scans, as well as gathering measures of variance in the rating of CT Scans within and between sites. As problems have arisen, it has become increasingly clear that many are common to other e-Health and e-Science projects, and a range of emerging solutions and practical workarounds is being shared through an informal brokerage between active players within the Grid community. 5.2. Variance across nodes As indicated earlier, ‘joined up’ systems face a range of known and unknown or unanticipated sources of variance in technical equipment, data acquisition, processing and curation, and also in human rating of images and of patient symptoms. Within NeuroGrid, the psychosis exemplar provided numerous opportunities to observe what Duguid and Brown [12] have called the ‘social life of information’ but has also provided interesting insights into the ways in which the technical and the human contribution of variance in data often becomes evident only in discussion with known datasets in real trials; 5.3. The role of informal dialogue The use of ethnographic studies of clinical research practice has been a key part of our approach to understanding NeuroGrid requirements. Our findings suggest that awareness of data quality issues often only comes as a result of real community interaction within a co-located community [13]. This is hard to emulate in transient virtual organizations of the kind envisaged in eHealth and eScience, yet incidental observations suggest that many key observations on data quality were dependent on informed exchanges, from different experts, where knowledge from different domains came into play in relation to a specific problem. Informal conversations between researchers in one example generated an awareness that the same protocol on the same image set had resulted in different
342
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
outcomes. Further discussions narrowed this down to differences in interpretation of a protocol, where tracing inside, or outside of a line resulted in volume differences. Discussion of a known dataset, in a known context, appears to help foreground anomalies, and improve data quality in ways that are hard to scale up. It also became evident from similar face-to-face discussions, again focusing multiple specialists on a shared problem, that aggregating data from sites with different demographic profiles was another source of variance, since brain shape is known to vary across ethnic groups, adding another dimension of variation within aggregated scan sets. 5.4. Involving stake-holding users There is a push to improve data quality throughout the UK National Health Service (NHS) and, specifically, to improve the quality of data for auditing. Auditors routinely access various source of data, then combine and triangulate them to improve the quality of data that they extract. Similarly, researchers make use of data extraction forms designed specifically to capture the data needed for epidemiological studies and research nurses exercise considerable skill in ensuring that the data they gather is fit for the intended purposes. On the scale of aggregation entailed in Grid-based systems, there is arguably a need for a wider awareness of the issues in aggregating from multiple sources and an emphasis on strategies that can be adopted at different stages in acquisition, mining and use, so as to safeguard quality and reliability for use in clinical contexts by frontline staff. One interesting development in this regard is the potential for more active engagement of patients themselves as stakeholders in the use and updating of their medical records [14]. The leverage of end-user communities as stakeholders in maintaining the accuracy or currency of the process is one which is associated with real benefits and cost savings in e-Business [15] and may have some application in the context of medical informatics. In terms of system design, the work of Reddy et al [2], and Dourish and Bellotti [16] suggest that clinical staff using eHealth systems can make sense of, and coordinate work better if the system affords some degree of transparency about the activities of other users, and provides a context for coordinating information and planning across a distributed group. 5.5. Making sense of distributed data The potential volume of data that can be aggregated via HealthGrids not only has implications for curation and quality but also for its interpretation by both humans and machines. Nonaka [17] highlights the importance of early articulation of shared frames of reference and situated contexts for envisaging and structuring the process collectively, by providing real or virtual opportunities for dialogue and exchange. In the more distributed context of the Grid, linking social and technical networks on an exceptionally large scale, there is increasing interest in the use of metadata and ontologies to formalise some elements of these shared frames of reference in human and machine readable form [18, 19]. Part of the motivation for this is that it affords automation of resource discovery and analysis, but the question remains as to whether formal descriptions can be sufficiently rich and expressive to model relationships between data providers and users. In this there is a trade-off between the benefits of share-ability and knowledge discovery across multiple datasets on the one hand, and the setting in stone of concepts and relationships which are constantly evolving. Our ability to anticipate the sorts of uses which might be made of data in the future, or other
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
343
ontologies with which they may be related, is time de-limited. As in many other contexts, there is a trade-off between speed, accuracy, validity and usability for particular purposes. As with the aggregation of multiple data sets discussed earlier, there are also aggregations of artefactual differences whose implications may be invisible to the user, but represent a potential risk in clinical use. As diverse medical datasets come online in related domains and at different scales, the alignment of ontologies becomes a challenge. In the context of neuroscience, for example, there are datasets at different levels of granularity as well as in different modalities. The work of Sporns [20] highlights the extent to which imaging can be done at very different levels of granularity, and that the value of much of the research now ongoing will be in the integration of cross referenced data that can elucidate the structure and dynamics of the brain at very different levels of granularity, such as: • • •
MR images of structural changes in the brain using CT, PET or SPECT scans diffusion tensor imaging studies on the micro-structural development of white matter in the brain underpinning activation patterns detected in MR imaging genetic datasets associated with susceptibility to these disorders
The Human Brain project [21] addressed this issue early on in the context of collaboration with multiple groups, generating a reference ontology based on a Foundational Model of Anatomy (FMA) which allows diverse datasets, at different levels of granularity, to be aligned in a meaningful context for different purposes. 5.6. Aligning Competing Requirements Many of the most intractable issues in integrated systems reflect the locally grounded nature of coordination and governance structures. Ethics and Security requirements were among the most recurrent issues encountered in NeuroGrid and eDiaMoND, and are one of a wide range of areas where there has been a tension between the requirements of distributed local groups. 5.6.1. Security vs access requirements Common to all NeuroGrid exemplars is the need to determine secure and effective ways to aggregate and manage clinical trials data. The data takes the form of medical images and coded or descriptive information from patients who have consented to take part in trials and whose records contain material that is often highly sensitive in nature. The retrieval and access of this data requires new architectures to support the secure sharing of the data, both records and medical images. In the case of NeuroGrid, this also includes issues of anonymisation of faces in brain scans, given the potential in some formats, for reconstruction of facial volumes. Within NeuroGrid, the exemplar groups need to run algorithms on other datasets that they do not own, and retrieve the results of this analysis; however, they do not receive the original data. In the case of scans of patients at risk of early-onset psychosis, direct access to the images is regarded as too sensitive and the solution agreed is to provide parametric statistical mappings of the original image data on which algorithms could be run, rather than the original. This adds some complexity to the workflows and the design as a whole, but aligns the competing requirements of the different stake-holding groups in a way which could be replicated to resolve this issue
344
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
elsewhere. Given the long term aims of translational medicine as a sustainable enterprise and the participation of commercial partners in clinical trials, both the architecture and the perception of security in Grid systems remains a critical issue [22]. 5.6.2. Ethical Requirements Issues such as ethical consent, IPR, and the development and implementation of shared protocols and administrative processes, challenge the local structures and in situ realization of coordination in distributed communities. Scaling up these less tangible architectures is a design issue of a more socio-political nature which has implications for how and if the e-Science vision can be implemented. While distributed, networked projects increasingly acknowledge the impact of human factors, the extent to which they can impede project realisation and the extent to which project work revolves around them is often under-estimated at the outset. By way of example, a recurrent barrier to the vision of e-Health is the difficulty of achieving agreement on ethical consent for use and/or re-use of patient data: neither NeuroGrid nor eDiaMoND are exceptions to this. The eDiaMoND project was required to demonstrate the use of a grid-enabled digital mammography system. To prove the concept, it was necessary to consider the use of real data in real breast screening units, hospitals and research environments. This entailed managing an intricate arrangement of policies governing the use of patient data (e.g., research ethics review). In addition to delays, constraints and complications, data generated from research and re-used for subsequent clinical work does not have clear ownership. In addition, there are often constraints on linkage between research and clinical infrastructures including links between healthcare service and university networks [7]. The vision of translational research is to quicken the process between bench science and the delivery of healthcare to patients. In practice, however, transient virtual collaborations of the kind envisaged in e-Science lack either the formal infrastructure of contractual agreements evident in business supply chains, or the established norms and agreements that are generated in well established communities of practice. It may be that technical infrastructures scale up more easily than the socio-political and administrative infrastructures of the communities in which they must be embedded and used. 5.6.3. Aligning technical vs user criteria and requirements Aligning requirements between distributed exemplar groups within a Grid project is one challenge, however, it is also the case that the stakeholders have competing aims and criteria. As the scale and scope of systems in the extended enterprise has grown, the difficulties of aligning aims and understanding across interdependent communities have become more critical, the interdependence of social and technical knowledge has become more apparent, and the tension between local and global requirements has become more problematic. A recent overview of system design in business contexts [6] suggests that technologists’ criteria for success are early closure on requirements, and adherence to time and cost constraints, with a robust design, while business managers criteria were, conversely, in favour of an evolving process that met a range of changing needs in a flexible way, and were not concerned about the cost, timescale or the design issues from a technological point of view. It is easy to see in this context how outcomes
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
345
satisfactory to one team might not meet the criteria of other stake-holding groups. This pattern was also evident in the eDiaMoND project, where the very different criteria, and aims of the technical and user communities significantly shaped the way this played out. The approach of the NeuroGrid team is to foster, where possible, a collaborative and participatory approach to design [23, 24] based on evolution from a very early prototype, around which system design could evolve in stages, from the basic need to share images which is core to all the exemplar groups.
6. Conclusions We have discussed a range of scenarios that can be found across the HealthGrid community, ranging from the issue of aggregating heterogeneous, distributed datasets to the issues of scaling up local processes, protocols and coordination and consent structures. The most intractable of these have their roots in the coupled, socio-technical nature of infrastructural systems, and the difficulties inherent in scaling up information and communication networks in the absence of a corresponding architecture for coordination at a social, organisational, professional and political level. 6.1. Working up socio-technical arrangements The concept of the collaboratory is central to the e-Science vision, yet there has been limited concern with the generation of the community and coordination infrastructures which will coordinate and sustain it. The experience of virtual business organisations in the context of the business supply chain suggest that the explicit management of the socio-technical whole is central to the success (or the failure) of collaboration. The eHealth vision – particularly in relation to translational medicine – embodies much of the supply chain concept and appears to be facing some of the same socio-technical challenges [8] [1]. NeuroGrid, like eDiaMoND, brings together disparate groups of clinicians, technologists and researchers with no prior working experience of large scale collaborative research or with the other project members. The technical work of system building is paralleled by the need to facilitate the generation of new structures and agreements for the governance of the new risks and opportunities generated when data is aggregated in this way. The creation of real and virtual ‘shared spaces’ [17] on NeuroGrid included an early prototype as a ‘sandpit’ for engagement in areas of shared professional concern, as means of supporting this new hybrid community develop its own rules of engagement, and start making collective sense of local knowledge and requirements in relation to common project goals. Common challenges are coming into focus across the exemplar groups, and further collaboration will be encouraged through the use of workshops and special interest activity to resolve common issues in areas such as data quality, security, data ownership, confidentiality, IPR, ethics, and the management of clinical trials. 6.2. Dealing with Data Quality In organic communities, the process of structuring collaboration, coordination and control structures happens as a matter of course, played out in shared contexts where
346
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
aims, terms and frames of reference are already well established. NeuroGrid is employing a simple early prototype to generate engagement and dialogue between partners, to enable earlier discussion of requirements for more complex services, compute capability and workflows, as well as data quality and configurational issues.In addition to ameliorating the recurring issue of requirements ‘creep’ late in the design process, it allows the disparate groups to discuss issues and possible actions in relation to a shared context. Given that, in reality, many Grid-based collaborations are transient, and often led by funding considerations rather than a clear consonance of aims across participating groups, system design and management will increasingly rely on the creation of coordinating infrastructures – social, legal, ethical and professional. The recurring nature of problem scenarios in HealthGrid projects suggests that community building strategies such as early prototyping will be increasingly central to the realisation of the e-Health vision, and that further research is needed to both (a) identify the ways in which some of this may be integrated into the process of co-designing such systems, and (b) share strategies for designing technical information and communication systems more effectively around human ones [25]. Virtual organizations (VOs) such as these require a strategy for the negotiation of shared terms, processes, costs, risks and benefits, as well as the definition of those which are to remain local and the nature of the alignment between the two [26]. Collaboration across communities of interest depends heavily on finding practical ways of ensuring early engagement and dialogue, in areas of shared concern, so that the negotiation of diverse aims and requirements can inform the design process as early as possible [27].
7. Acknowledgements The authors wish to thank other members of the NeuroGrid consortium for their input into, and comments on the work described in this paper, and in particular the role of the scientific collaborators in determining the requirements for this project. The authors wish to acknowledge the support provided by the funders of the NeuroGrid project: The MRC (ref no: GO600623 ID number 77729) and also the support of the UK eScience programme.
References [1] [2]
[3]
[4] [5]
UK Clinical Research Collaboration Report www.ukcrc.org Hartswood, M., Procter R., Rouncefield, M., Slack R., Soutter, J. and Voss, A. (2003). ‘Repairing’ the Machine: A case study of evaluating computer-aided tools in breast screening. In Kuutti et al (Eds) Proceedings of the Eighth European Conference on Computer-supported Cooperative Works. pp 375394. Ellingsen, G. and Monteiro, E. (2001). A patchwork planet: The heterogeneity of electronic patient record systems in hospitals. In Proceedings of the Information Systems Research Seminar in Scandinavia, Sweden, August. Geddes J., et al. (2005). NeuroGrid: Using Grid Technology to Advance Neuroscience, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05), pp. 570-572. Brady, M., Gavaghan, D., Simpson, A., Highnam, R. and Mulet, M. (2002). eDiaMoND: a grid-enabled federated database of annotated mammograms. Chapter 39 of Grid Computing: Making the Global Infrastructure a Reality, Wiley.
J. Geddes et al. / Designing for e-Health: Recurring Scenarios
[6]
[7]
[8]
[9] [10] [11]
[12] [13]
[14] [15] [16] [17] [18] [19] [20] [21]
[22]
[23]
[24]
[25]
[26] [27]
347
Lloyd, A., Ure, J., Cranmore, A., Dewar, R. and Pooley, R. (2002). Designing Enterprise Systems. In Jardim-Goncalves R. Roy R. and Steiger Garcao A. (eds.) Advances in Concurrent Engineering. Swets and Zeitlinger, Lisse. Hartswood, M., Jirotka, M., Procter, R., Slack, R., Voss, A. & Lloyd, S. (2005). Working IT out in eScience: Experiences of Requirements Capture in a HealthGrid Project. In Solomonides, T., McClatchy R., Breton V., Legre, Y. & Norager, S. (eds.) From Grid to HealthGrid. IOS Press ISSN 0926-9630. Jirotka, M., Procter, R., Hartswood, M., Slack, R., Coopmans, C., Hinds, C. and Voss, A. (2005). Collaboration and Trust in Healthcare Innovation: the eDiaMoND Case Study. Journal of ComputerSupported Cooperative Work, 14(4), p. 369-389. Reddy, M., Pratt, W., Dourish, P. and Shabot, M.M. (2003). Sociotechnical Requirements Analysis in Clinical Systems. Methods Inf Med 2003; 42: pp 437-44. Buetow, K. H. (2005). Cyberinfrastructure: Empowering a ‘Third Way’ in Biomedical Research. Science 308, May 6. pp 821-824. Breton, V., Dean, K. and Solmonides, T. (2005). The HealthGrid White paper, In Solomonides, T., McClatchy R., Breton V., Legre, Y. & Norager, S. (eds.) From Grid to HealthGrid. IOS Press ISSN 0926-9630. Duguid, P. & Brown, J.S. (2000). The Social Life of Information, Boston, MA: Harvard Business School Press. Hartswood, M, Procter, R., Rouncefield, M. and Slack, R. (2003). Making a Case in Medical Work: Implications for the Electronic Medical Record. Journal of Computer-Supported Cooperative Work, 12(3), p. 241-66. Pagliari, C., Donnan, P., Morrison, J., Ricketts, I., Gregor, P. and Sullivan, F. (2005). Adoption and perception of electronic clinical communications in Scotland. Informatics in Primary Care. 13 (2). Sawhney, M. & Parikh, D. (2001). Where Value Lives in a Networked world, Harvard Business Review, January, pp175-198. Dourish, P. and Bellotti, V. (1992). Awareness and Coordination in Shared Workspaces. In Proc. ACM Conference on Computer Supported Work, pp107-114. Nonaka, I. and Nishiguchi, T. (eds.) (2001). Knowledge Emergence: Social technical and Evolutionary Dimensions of Knowledge Creation, Oxford University Press, Oxford. De Roure, D., Jennings, N.R. and Shadbolt, N.R. (2001). Research Agenda for the Semantic Grid: a future e-Science Infrastructure. Technical Report for the e-Science Core Programme. Bechhofer, S.K., Rector, A.L. and Goble, C.A. (2003). Building Ontologies in DAML+OIL. Comparative and Functional Genomics, Volume 4, pp 133–141. John Wiley ISSN 15316912. Sporns, O., Tononi, G. and Kötter, R. (2005). The Human Connectome: A Structural Description of the Human Brain. PLoS Comput Biol 1(4): e42. Rosse, C., Kumar, A., Mejino Jr., J. L.V. Cook, D.L., Detwiler L.T. and Smith, B. (2005). A Strategy for Improving and Integrating Biomedical Ontologies in: Proceedings of AMIA Symposium 2005, Washington D.C., 639-643 Breton, V., Dean, K. and Solmonides, T. (2005). The HealthGrid White paper, In Solomonides, T., McClatchy R., Breton V., Legre, Y. & Norager, S. (eds.) From Grid to HealthGrid. IOS Press ISSN 0926-9630. Buscher, M., Shapiro, D., Hartswood, M., Procter, R., Slack, R., Voss, A. and Mogensen, P. (2002). Promises, Premises and Risks: Sharing Responsibilities, Working Up Trust and Sustaining Commitment in Participatory Design Projects. In Proceedings of the Participatory Design Conference, Malmo, June, pp 183-92. Hartswood, M., Procter, R., Rouchy, P., Rouncefield, M, Slack, R. and Voss, A. (2002). Co-realisation: Towards a Principled Synthesis of Ethnomethodology and Participatory Design. In Berg, M., Henriksen, D., Pors J. and Winthereik, B. (eds.), special issue on Challenging Practice: Reflections on the Appropriateness of Fieldwork as Research Method in Information Systems Research, Scandinavian Journal of Information Systems, 14(2), p. 9-30. Bijker, W.E., Hughes, T.P. and Pinch, T.F. (1989). The social construction of technological systems: New Directions. In Bijker, W. and Law, J. (eds) Shaping Technology, Building Society: Studies in Sociotechnical Change. MIT Press, Cambridge, Mass. Von Krogh, F.G., Nonaka, I. and Nishiguchi, T. (2000). Knowledge Creation, Macmillan, London. Wenger, E. and Snyder, W. (2002). Communities of practice: the organizational frontier, Harvard business review, Jan/Feb, 139-145
348
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Design and implementation of security in a data collection system for epidemiology John Ainsworth1, Robert Harper, Ismael Juma, Iain Buchan School of Medicine, The University of Manchester, UK Abstract. Health informatics can benefit greatly from the e-Science approach, which is characterised by large scale distributed resource sharing and collaboration. Ensuring the privacy and confidentiality of data has always been the first requirement of health informatics systems. The PsyGrid data collection system, addresses both, providing secure distributed data collection for epidemiology. We have used Grid-computing approaches and technologies to address this problem. We describe the architecture and implementation of the security sub-system in detail. Keywords: Security, data collection, authentication, authorization, epidemiology
Introduction PsyGrid is an e-Science [1] project that has the objective of using Grid-computing techniques and technologies [2] to develop a system that can be used for epidemiology. The process of epidemiological medical research has three phases - the establishment and characterisation of a large, representative cohort from a geographically distributed population; the integration of the cohort data with other data sources to provide additional characterisation; the formulation of a hypothesis and generation of the corresponding predictions. For the establishment and characterisation of a cohort, many epidemiological studies use paper based data collection systems. A computer based data collection system which enables geographically distributed data collection, would alleviate much of the labour, tedium and error that are inherent in paper-based data collection. Such a system is required to store personal, confidential medical data, and this data must be sent across a network from remote data entry clients to a data repository server. The storage of personal medical data requires an access control system that can enforce restrictions on who can access and operate on the data. The transfer of this data between two computer systems requires communications protocols that encrypt the data and are resistant to interception [3]. In this paper we report on the design and implementation of the security subsystem in the PsyGrid Data Collection System (DCS). We show how Grid-computing technologies and techniques can be applied to address the problem of security in a distributed data collection system, and report on the application of the system to psychiatry, and its use to establish a cohort for first episode psychosis (FEP). ___________________________________________________ 1 Mr John Ainsworth, ISBE, Stopford Building, University of Manchester, Manchester, M13 9PT, UK. john.ainsworth@manchester.ac.uk
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
349
1. The PsyGrid Project E-Science is concerned with the application of information technologies to address scientific research questions. The aims of e-Science are not just to raise the scientist above the boring, mundane and error prone tasks, but also to enable new lines of scientific inquiry [4]. The e-Science approach usually involves one or more of the following: large-scale, geographically distributed collaboration; the pooling and sharing of resources; the composition of small reusable tasks to perform complex processing; and the automation of mundane repetitive tasks. For computer scientists, the challenge of e-Science is to create systems which not only perform these functions, but systems that can also be easily used, re-used and reapplied to address a range of research questions from different domains, with minimal, ideally no, further intervention from the computer scientist. Existing systems for epidemiology [5] are often created in an ad-hoc fashion, leading to systems that are single-use and isolated: consequently they are difficult to re-apply to other domains. Grid-computing tools and techniques have co-evolved with the e-Science paradigm. Grid computing middleware provides the system developer with the tools necessary to develop systems where the problem to be solved requires secure, scalable, distributed sharing of resources. An e-Science system for hypothesis-driven epidemiology requires just such an architecture. Consider the following use case: to investigate the aetiology and outcomes of psychosis a research project is established, to record the psychological assessments of subjects at regular intervals from their first episode of psychosis (FEP). As this is medical data, confidentiality and privacy is paramount. Data is required to be collected from eight different geographic areas in England and held in a data repository. Each of the eight different areas corresponds to a hub of the Mental Health Research Network (MHRN), which are autonomous and have ownership of the data they collect. The data could be stored locally in a repository that is owned and operated by a single hub; however the full potential of this data is only realised when analysis is performed on the combined data set, covering all hubs, and it is integrated with other data sets that characterise the sample population in some other way, such as census data. The above is the primary use case of the PsyGrid project. The goals of the PsyGrid project are to develop systems and tools to facilitate hypothesis driven epidemiology research and multi-centre clinical trials in psychiatry. In the first phase of the project we will develop a secure data collection system, and apply it to the establishment of the FEP cohort. In the second phase we will integrate other sources of data, such as socieconomic data, and medical imaging. Then next phase will see the provision of workflow tools for the automation of statistical analysis, which can be shared and reused. All of the software developed by the PsyGrid project is open-source and will be freely available for use for any study requiring distributed data collection capability. PsyGrid is part of the UK e-Science programme and is funded by the Medical Research Council and the UK government Department of Health, and is supported by the Mental Health Research Network (MHRN). 2. Overview of the PsyGrid Data Collection System The PsyGrid data collection system has four principle components, which are shown in Figure 1. The first of these is a data repository capable of storing multiple
350
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
independent data sets. The data schema for a particular data set is user definable, so that the repository can be reused for other clinical research data collection projects. The second is the Data Collection Client Application (DCCA), which enables remote collection of data from multiple sites. The data collection client automatically generates its graphical interface from a data set’s schema definition. Thirdly there is the security sub-system as describe in detail in this paper performs authentication and access control. The definition of a data set also includes the definition of the security policy associated with it. This security policy defines the roles that PsyGrid users have for this data set and defines the actions that a particular role can perform on the data set. The fourth PsyGrid DCS component will be a data set creation and deployment tool that will guide end users through the process of creating and deploying the components required for the collection of new data set. This will include an extensible schema editor for the repository, a security policy editor, and a deployment manager. These tools are presented to the user together as the Project Manager Application. The security policy editor will allow the end user to assign users to roles and roles to permissions. Once the data set is defined and the security policy created, then the deployment manager will guide the user through the deployment process. PsyGrid, like many Grid-based systems has a service-oriented architecture (SOA) [6], and the users access the system through a rich client application. 3. Design Goals Any distributed computer based medical system holding patient data must address four different aspects of security. Firstly the computers that host the system must be safe from attack. As this is largely determined by the operating system, and the way in which the system is administered it has no impact on the design of the DCS security system so we do not discuss this further. Secondly, data communications between machines that host the system must be protected from interception. The third is ensuring that a user can be identified and can only access the parts of the system that they are entitled to. Finally, we must ensure that identity cannot be recovered from depersonalise data, through statistical analysis. In this paper we a concerned with communications security and access control to the data collection system. We intend to report on statistical disclosure control in a future paper. To accommodate the widest possible range of usage scenarios, the data collection system will need to support the most stringent of security requirements. The challenge goal we set for PsyGrid DCS was for it to be deployable on the public internet, as opposed to a closed private network, be accessible from any point on the internet and be capable of storing patient identifiable data. 3.1. Ease of use and operation The primary requirement of PsyGrid DCS is that it should be simple to use, operate and re-apply to many different epidemiological studies. Adoption of a system, regardless of features, will not occur if its usage requires technical knowledge that the target user community does not possess, and is unwilling to acquire. The implications of this for the security sub-system are two-fold. From an end-users perspective security details must be hidden as much as possible. Most clinicians are not interested in managing X.509 [8] certificates, for example. From a system administrator’s
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
351
perspective, it must be easy to add users, assign them privileges, and modify policies. Most system administrators are not interested in editing XML. For each of these tasks the system should provide graphical tools, which hide the underlying complexity. 3.2. Support for multiple projects on one system The PsyGrid DCS had the requirement that a single deployment can support multiple independent data collection projects. Associated with each of these must be an access control policy, which defines the members of the user group and their access rights. In PsyGrid project groups need to be isolated from each other; a user cannot see that a data set exists, unless they are a member of the owning group. A user can however be a member of multiple groups. In effect, a single deployment of the system needs to be able to support multiple closed users groups, and give to the end user the appearance of an exclusive use of the system for their projects. 3.3. Scaling and flexible deployment For the data collection system to reach as wide a community of users as possible, it needs to be scalable through the addition of compute and storage resources. This requires the system architecture to be highly modularized with loose coupling between the components, such that components can be flexibly deployed on the available hardware. Role-Based Access Control schemes have the property that the access control policy remains unchanged when users are added or deleted from the system, and so can help ease the administrative burden in systems. It must also be possible to deploy the system with either a centralised data repository or with multiple federated data repositories. In the federated scenario, multiple instances of the PsyGrid DCS, where each one is operated by different autonomous organisation and has its own user directory and security policy, can be used collaboratively, such that a user from one DCS in the federation can access another DCS without repeated authentication.
4. Architecture The architecture of the PsyGrid Data Collection System is shown in Figure 1. Each server-side component in the PsyGrid system has a well-defined interface and can be remote from any other component, which follows from the Service Oriented Architecture. Consequently communication between any two components could be occurring over the public internet. To ensure confidentiality and privacy, all components in the PsyGrid system communicate over secure, encrypted communications links. We have chosen to use Transport Layer Security (TLS) [9] in conjunction with HTTP as the transport for our SOAP-based web services. We use TLS in mutual authentication mode to ensure both server and client can be sure of the identity of the other party. We choose TLS over Message Level Security (MLS) [10]. The advantage of TLS is that it is widely available, and it is mature. In addition, TLS implementations would appear to be much faster than using Java-based WS-Security. Where LDAP is used for authentication we use the encrypted LDAPS to ensure that user identities and passwords are not sent in the clear.
352
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
Based on our end user community we ruled out the possibility of using file system based PKI. It is unrealistic to expect clinical users to deal with complexity of managing certificates and keys themselves, and it would be a deterrent to the use of the system, which was not our aim. This left the choice of using user name and password for authentication, combined with a credential repository that can act as an online certificate authority (CA) to issue temporary PKI credentials, or using a hardware token approach. The later was rejected for the initial deployment of the data collection system, although it is the more secure solution, because of the additional operational effort required. However, there is nothing in the architecture presented here that precludes this approach as a future alternative, and it is our long-term aim to use this approach.
Figure 1 PsyGrid System Architecture
4.1. Role-base Access Control PsyGrid employs Role-Based Access Control (RBAC) [11]. In RBAC users are assigned privileges, and authorisation decisions are based upon the possession of the required privilege. There are three components in RBAC. The first is a privilege manager, which maps a user to their privileges. The second is a policy decision point (PDP) that is used to control access to a resource. The third component is a Policy Decision Function (PDF), which is used to make a decision on whether a user has sufficient privileges to access a resource. In PsyGrid, the Attribute Authority (AA) provides the privilege management function. It stores the list of projects that are active on the system. For each project it records its name, a unique identifier, a list of the sub-groups of the project and the roles that users can take on in this project. It also maintains a registry of users, and their privileges. The users privileges are maintained on a per projects basis. For each project the user is a member of, the privileges granted to the user (role or group membership) are listed. The AA issues Security Assertion Mark-up Language (SAML)
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
353
[12] tokens which bind a users identity to their privileges in a project. The AA signs [13] these statements, which guarantees their authenticity. Any entity that trusts the AA, can accept its assertion about a user’s privileges. The Policy Authority (PA) maintains the security policy. It stores multiple policies, such that each data collection project can have a unique policy. A policy consists of statements, and each statement has an action, target and a rule. The rule is a Boolean logic expression composed of operators (AND, OR, NOT) and privileges. A rule may be composed of many sub-expressions. The action is the operation the user wishes to perform, and the target is the resource on which they want to perform it. The Access Enforcement Function (AEF) provides the policy decision function functionality. It is a client side API for the PA, which can be invoked from any web service that protects a resource. The AEF requires the caller to supply the target and action, and either the users identity, or a signed SAML assertion that can be verified. 4.2. Presenting Identity vs. Asserting Attributes The issue of whether the users attributes are pushed to an AEF by the requestor or pulled by the AEF from the attribute authority is common design choice RBAC systems. The following information must be present in any service invocation: • The identity of the user performing the operation. • The identity of the client invoking the operation, which may not match the identity of the user performing the operation when delegation is being used. • The target to be operated on, if multiple targets are accessed through a single operation. From this information, the user’s roles in the group that own the target can be retrieved from the attribute authority (pull). Alternatively, the user’s role in the target’s group can be pushed along with the target and user identity information (push). In either push or pull mode the information binding a user’s identity to their privileges is transferred using a SAML assertion signed by the attribute authority. The PsyGrid DCS currently uses the push mode, though there is nothing in the architecture that precludes the pull model. 4.3. Source of Authority The credential repository is used to issue short-lived user credentials. However, an alternative source of authority, the offline PsyGrid CA, is used to issue long-lived credentials. This gives us two levels of trust, determined by the authority that issued the end entity’s credentials. Those in possession of a credential from the PsyGrid CA, typically servers, are able to invoke services on behalf of other users. In this case, the identity presented during authentication and the identity of the subject in the SAML assertion need not be the same. The basic interaction between the client, the data repository and the security system proceeds as follows: 1. The user launches the data collection client application and logs in with their user name and password. The DCCA passes the login information to the credential repository. 2. The credential repository tries to authenticate the user with the directory, if successful, it uses its online certificate authority to generate a temporary PKI credential for the user.
354
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
3.
The DCCA contacts the attribute authority which will issue a signed SAML assertion for the user which contains their privileges for the dataset 4. The DCCA invokes an operation on a web service, passing the signed SAML assertion. 5. The web service calls the policy decision function, including the operation, target and SAML assertion. 6. Based on the users privileges and the stored policy the Policy Decision Function will either grant or deny access. 7. If access is granted then the operation is performed on the web service and the results returned. By using signed SAML assertions to identify a user’s roles, and using a role-based access control system, then to federate multiple data collection systems only requires the policy decision function to accept SAML assertions signed by the other attribute authorities participating in the federation. In the current implementation, the PA is configured with a list of AA’s that it trusts. This is not a scalable solution in the long term, should a federation grow above tens of systems. 4.4. Security Policy Management One of the goals of PsyGrid is to enable its end users to manage projects, and the associated security policies. This requires that security policy management is simple and intuitive. The basic requirement is to enable project managers to include users in a project, and allocate users roles within that project. Both the PA and the AA provide query and administration interfaces. Access control on the query interfaces is based solely on the X.509 certificate presented during SSL mutual authentication, and an access control list. The access control list can contain the distinguished name of the users that are allowed to access the service, and also a list of the distinguished names of certificate issuers who are trusted, such that any user presenting a certificate issued by one of these authorities is granted access. Access rights are further sub-divided into 3 classes - administration, proxy and query. The administration class allows the user to invoke any operation on the administration interface, and operate on any project. This class is reserved for the system administrator, and should only be used during initial system configuration. Once the system has the minimal configuration installed, then the PA and the AA themselves should be used for access control on their administrative interfaces. The proxy and query classes can only grant access to the query interface. Users in the proxy class may make queries over all the data in the PA and the AA. They may also requested a signed SAML assertion for any user, hence the terminology for this class. The users in the query class may only query data about themselves and their privileges, and request SAML assertions about their own privileges. 4.5. Boot-strapping system security policy When the system is installed, the PA’s and the AA’s access control lists must be edited to include the DN of one user who will have administration privileges. The first task of the system administrator is to create the System project in the AA and the System policy in the PA. Two roles are defined: SystemAdministrator and ProjectManager. The System policy grants full access to the resources protected
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
355
by the policy. The resources protected by the System policy and privileges are in fact themselves, and also the ability to create a new project and policy. The Project Manager role is allowed to create new projects. The initial configuration of the AA requires users to be added to the System project and assigned the appropriate role. Once this is in place, authorisation for editing the AA and PA can be determined by the AA-PA themselves, and so the users can begin to self-manage the system. 4.6. Creating a new project When a ProjectManager creates a new project, they are automatically added to that project with the role ProjectAdministrator. The corresponding Policy is installed in the PA automatically giving the ProjectAdministrator full rights over this project. From this point they can add users, assign them privileges and define the details of the security policy, using the Project Management client application provided by PsyGrid.
5. Implementation The security system is composed of a user directory, a credential repository, an attribute authority (AA) and a policy authority (PA). All except the user directory provide web service interfaces. For the user directory we have used OpenLDAP [14], though any LDAP directory could have been used. The credential repository is myProxy [15] that has been configured to act as an on-line certificate authority (CA). It will issue an end user with a public-private key, and a certificate confirming their identity, as determined by their LDAP distinguished name, if their user name and password can be used to bind to the LDAP directory. The attribute authority stores information on data collection projects and on users privileges. It is implemented as a set of persistable Java classes, mapped to a relational database. The definition of a project includes the project name, a list of sub-groups, which exist for this project, and a list of the valid roles, which a user can be assigned. A user’s role or group membership is termed a “privilege”. The AA knows each user by their Distinguished Name, and the AA stores the list of projects and privileges for the user. The AA provides a web service interface, which provides operations to query project information and user privileges. The other major function of the AA is the issuing of signed SAML assertions, which binds a users identity to their privileges. We have used the OpenSAML [16] implementation for this. The Policy Authority makes access control decisions, based on the stored policy and the user privileges supplied in the SAML assertion. Policies are implemented as sets of persistable Java objects, and any number of policies may be stored; typically there will be one policy for each data collection project. A policy consists of statements, and each statement has an action, target and a rule. The PA verifies the validity, and integrity of a SAML assertion and confirms it comes from a trusted AA. It then checks the policy to determine if the supplied privileges are sufficient for the request target and action. In this context a target corresponds to an object or object group in the data repository and the action is the operation to be performed.
356
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
Client side APIs have been developed for both the PA and AA to hide the details of accessing the web service. This means the services using the access control functionality need only call the makePolicyDecision function provide by the API.
6. Future Work The Project Manager Client Application, which comprise of the Data Set Designer and the Security Manager has yet to be implemented. Until this component is available the vision of a system that end users administer and operate cannot be realised. Integration with Grid Security Infrastructure [17] will be required in the near future, but initially only as a consumer of GSI protected services. We would like to move to a standard policy language such as XACML [18] for the policy authority or a complete privilege management system such as PERMIS [19]. Whilst the security system has been designed to operate in a federate mode it has not yet been tested, and operational procedures have yet to be defined which will enable a single Certificate Authority to be the root of trust for all attribute authorities. We plan to investigate standardized federation using Shibboleth [20]. The integration of a two factor authentication using a hardware token will enhance the security system, and eliminate the need for the credential repository, though it will come with an operational overhead of certificate management. Integration of FAME [21] will provide for flexible authentication and the ability to feed the authentication Level of Assurance into the access control function.
7. Conclusion The PsyGrid Data Collection System has been deployed to establish a cohort of First Episode Psychosis for early intervention research in psychiatry. We have deployed the system within the United Kingdom National Health Service (NHS) network, which is a closed network only accessible to NHS personnel. This has proved us with a safe environment in which we can gain operational experience of the system whilst minimizing the risk of data disclosure. If this trial data collection period is successful we intend to make the system publicly available.
Acknowledgements The PsyGrid project is funded by the UK Medical Research Council and the Department of Health as part of the e-Science programme. This work has been shaped by discussions and suggestions from many people and we would particularly like to express our gratitude to Professor Shôn Lewis and Professor Carole Goble.
References [1] T. Hey and A. Trefethen, The UK e-Science Core Programme and the Grid, Future Generation Computer Systems 18 (2002) 1017-1031
J. Ainsworth et al. / Design and Implementation of Security in a Data Collection System
357
[2] Ian Foster, Carl Kesselman, The grid: blueprint for a new computing infrastructure, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998
[3] B. Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley & Sons, Inc., second edition, 1996. [4] CA Goble, S Pettifer, and R Stevens. Knowledge Integration: In silico Experiments in Bioinformatics in The Grid: Blueprint for a New Computing. Morgan Kaufman, 2003.
[5] O'Carroll P, Yasnoff W, Ward E, Ripp L, Martin E. Public Health Informatics and Information Systems. New York: Springer 2003.
[6] The PsyGrid project, http://www.psygrid.org [7] M. Atkinson, et al. Web Service Grids: An evolutionary approach, UK eScience Technical Report UKeS-2004-05 [8] ITU-T Rec. X.509|ISO/IEC 9594-8, The Directory: Authentication Framework, 2000. [9] The TLS Protocol Version 1.0, IETF RFC2246. Available online at http://www.ietf.org/rfc/rfc2246.txt
[10] OASIS Web Services Security 1.0, Organization for the Advancement of Structured Information [11] [12] [13] [14] [15] [16] [17]
[18] [19]
[20] [21]
Standards. Available online at http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=wss Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E. Role Based Access Control Models. IEEE Computer 29, 2 (Feb 1996), p38-43. Security Assertion Mark-up Language, OASIS online at http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=security XML Digital Signature Syntax and Processing http://www.ietf.org/rfc/rfc3275.txt OpenLDAP, http://www.openldap.org J. Basney, M. Humphrey, and V. Welch. The MyProxy Online Credential Repository. Software: Practice and Experience, Volume 35, Issue 9, July 2005, pages 801-816. OpenSAML, http://www.opensaml.org Welch, V., Siebenlist, F., Foster, I., Bresnahan, J., Czajkowski, K., Gawor, J., Kesselman, C., Meder, S., Pearlman, L., and Tuecke, S. Security for Grid Services. In Proceedings of the Twelfth International Symposium on High Performance Distributed Computing (HPDC-12), IEEE Press, June 2003. eXtensible Access Control Markup Language (XACML), http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=xacml Chadwick, D. W. and Otenko, A. 2002. The PERMIS X.509 role based privilege management infrastructure. In Proceedings of the Seventh ACM Symposium on Access Control Models and Technologies (Monterey, California, USA, June 03 - 04, 2002). SACMAT '02. ACM Press, New York, NY, 135-140 Shibboleth, http://shibboleth.internet2.edu/ J. S. Chin, M. Parkin, N. Zhang, A. Nenadic, J. M. Brooke, An Authentication Strength Linked Access Control Middleware for the Grid, International Journal of Information Technology, 11(4), pp 1-12, 2005.
358
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Architecture of Authorization Mechanism for Medical Data Sharing on the Grid Takahito Tashiro
, Susume Date , Singo Takeda , Ichiro Hasegawa , and Shinji Shimojo Osaka JGN II Research Center, National Institute of Information and Communications Technology Graduate School of Information Science and Technology, Osaka University Cybermedia Center, Osaka University Abstract. Data security is becoming increasingly important as the Grid matures. The advances of the Grid have allowed scientists and researchers to build a data grid where they can share and exchange research-related data and information. In reality, however, these specialists do not benefit enough from this data grid. The reason is that the current Grid does not have sufficiently robust and flexible data security. We investigate a medical data-sharing environment where medical doctors and scientists can securely share clinical and medical research data. We show medical data sharing that takes advantage of PERMIS, or an RBAC-based authorization system that achieves XML element level access control. We also describe the lessons learnt in designing the environment as well as a comparison with other existing authorization mechanisms. Keywords. data security on the Grid, PERMIS, RBAC, XML element level access control
1. Introduction Demand for the Grid [1] has been increasing, and it has been aggressively developed in hope that it will provide a platform for next-generation scientific research and development. The Grid allows us to dynamically share computational resources and scientific measurement devices geographically distributed among multiple organizations, such as universities and scientific institutions. The data and information used for collaboration on the Internet are easily integrated on the Grid. The Grid can dynamically aggregate diverse computational resources and share and exchange data and information on the Grid. These capabilities have been increasing the interest of scientists and researchers in a new research platform that will improve the efficiency of their research. A security solution that can protect data and services against malicious users has become increasingly important as the demand on the Grid has increased. Grid technology allows us to share and exchange the data, information, and services necessary for our research among multiple administrative domains. As a result, a robust and flexible ac1 Cybermedia Center, Osaka University, Mihogaoka 5–1, Ibaraki, Osaka, 567–0047 Japan. Tel: +81 6 6879 3865; Fax: +81 6 6879 3864; E-mail: tashiro@ais.cmc.osaka-u.ac.jp
T. Tashiro et al. / Architecture of Authorization Mechanism
359
cess control mechanism that allows administrators to grant access privileges to the right people is essential for data security. However, in reality, even the Globus Toolkit [2], which is de facto standard middleware for constructing the Grid environment, does not provide such a data access control mechanism by default. The Globus Toolkit offers an excellent authentication infrastructure, that is, the Grid Security Infrastructure (GSI) [3]. GSI is a de facto standard authentication system on the Grid, and it provides a robust and consistent authentication infrastructure through the use of the X.509 Public Key Certificate (PKC). With the infrastructure, users can perform single sign-on authentication on the Grid. Nonetheless, GSI cannot offer access control that adequately satisfies the requirement for data security and service protection. In this paper, we describe our exploration and investigation of existing authorization mechanisms on the Grid. We consider a medical data-sharing environment that requires access control to services and data. In particular, this paper focuses on a Role Based Access Control (RBAC) [4] implementations, or the Privilege and Role Management Infrastructure Standards Validation (PERMIS) [5] system, in designing a medical data sharing environment. Through the consideration and designing of the environment, we compare it with related authorization mechanisms. This paper is structured as follows. Section 2 describes the medical data-sharing model we consider and then presents the requirements for the authorization mechanism applicable to our model. Section 3 compares and discusses various existing authorization mechanisms with respect to our requirements. Our access control mechanism using PERMIS is detailed in section 4. A brief summary is given in Section 5 along with concluding remarks and ideas for future work.
2. Our Scenario: Medical Data Sharing Environment In this section, we introduce a simplified medical data-sharing environment that requires a flexible access control mechanism. We then explain the necessity for the authorization mechanism to achieve such access control using a use-case scenario in our medical datasharing environment. 2.1. A Use-case Scenario in Our Medical Data Sharing Environment Figure 1 shows the overview of a use-case scenario in our medical data sharing environment. In this scenario, the medical database is accessed and shared by medical doctors and researchers such as biologists and computer engineers. The concept behind the scenario is that the medical database is used for not only practical medical care but also research. Electronic medical charts were introduced in IT-advanced hospitals to record the history of medical care electronically. In general, these charts are only stored and used locally. Additionally, these hospitals mostly use electronic charts only for practical medical care. The challenge we face in this scenario is to achieve a medical data-sharing environment where clinical data and related research data is managed — in addition to a robust and flexible access control mechanism for data security. We conducted this research to make medical research and clinics more efficient. By making use of the data-sharing environment described, every research institution can uti-
360
T. Tashiro et al. / Architecture of Authorization Mechanism
Figure 1. Overview of Use-case Scenario in Our Medical Data Sharing Environment
lize this data for their own research and return feedback based on their research findings. We expect that this cycle will encourage sharing of advanced medical knowledge and enable greater improvements to be made in medical technology. We herein define a medical chart that is specialized for this purpose as an “analysis chart.” The details of the analysis chart are provided in section 2.2. However, to achieve this data-sharing environment, we strongly need a robust and flexible data access control mechanism that satisfies a variety of user requirements for data security on the Grid based on user identities and attributes. Examples of such user attributes include the organization to which they belong and/or their duties, responsibilities, and qualifications in the organization. 2.2. Details of Our Use-case Scenario In our scenario, the analysis chart is assumed to be written in a XML file format. Figure 2 shows an example of the analysis chart. The analysis chart is composed of following items, “name,” “ID,” “age” and “sex” of patient, “name of attending doctor,” “symptoms,” “orders” and “results” of medical tests, and “comments on the results.” The people involved in our scenario have one of the following attributes, “doctor,” “trainee doctor,” “nurse,” and “clinical technologist” in the hospital and “researcher” in other research institution. These attributes may be reassigned and/or revoked upon promotion or personnel change. For example, a person who is “trainee doctor” may be promoted to “doctor,” and other person who is “doctor” may transfer to other research institution as “researcher.” Table 1 shows who is granted the privilege to perform which actions on which item of the analysis chart. We call these rules “authorization policy.” This policy specifies that the “doctor” can read and write all items, except for writing the “results of a medical test,” but the “trainee doctor” is not granted the privilege to submit “orders for medical tests” or to write “comments on the tests’ results.” A “Nurse” cannot write any items. Only a “clinical technologist” is allowed to write the “results of a medical test.” A “Researcher”
361
T. Tashiro et al. / Architecture of Authorization Mechanism
<PatientName>Takahito Tashiro <PatientID>01234567 26 <Sex>Male O Susumu Data <Symptoms>xxxxxxxxxxxx xxxxxxxxxx 20060113 <Time>1400 xxxxxxxxxxxxx <Writer>Susumu Date 2006115 <Time>0927 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <Writer>Shinji Shimojo 2006118 <Time>2235 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Figure 2. Example of Analysis Chart
Table 1. Example of Authorization Policy. “r” means the “User” is granted the privilege to read the “Item,” and “w” means the “User” is granted the privilege to write the “Item.” User Item
Doctor
Trainee Doctor
Nurse
Clinical Technologist
Researcher
Patient’s Name, ID
rw
rw
r-
r-
--
Patient’s Sex, Age
rw
rw
r-
r-
r-
Doctor’s Name
rw
rw
r-
r-
--
Symptoms
rw
rw
r-
r-
r-
Orders for Medical Tests
rw
r-
r-
r-
r-
Results of Medical Tests
r-
r-
r-
rw
r-
Comments
rw
r-
--
--
rw
cannot read information that specifies who the patients are. However, he/she can read the physical features of a patient (in this case, age and sex), symptoms, items related to medical tests for use during their research, and he/she can write his/her opinions and findings based on the research in a “comments” field.
362
T. Tashiro et al. / Architecture of Authorization Mechanism
2.3. Requirements for Authorization Mechanism To satisfy the security requirements in our scenario, we stipulated that authorization mechanism should have the following features. 1. Privilege assignment based on users’ attributes: It can grant appropriate access privileges to users according to these attributes. 2. Access control in fine granularity: It can access arbitrary parts of data. 3. Manageability of system: The system can be easily introduced and managed. The details of each of them are shown next. The first feature enables granting appropriate access privileges to users based on their attributes. Discretionary Access Control (DAC) [6] is one of the most common and conventional authorization mechanisms; Unix systems have adopted this mechanism. DAC manages access based on the identity of users and/or groups to which they belong. However, DAC cannot bind privileges of multiple groups to a target resource. Our scenario needs to control access based on multiple user attributes. Thus, DAC is insufficient to meet our requirement. Another common authorization mechanism is Mandatory Access Control (MAC) [6]. In MAC, every user and target resource is assigned sensitivity labels. These labels specify hierarchical classification levels. MAC makes an authorization decision based on these labels. MAC is suitable for the authorization model based on a hierarchical structure like the one in the military. But the entirety of our users’ privileges model cannot be simply described using a hierarchical structure. Thus, MAC cannot be applied to our model. RBAC is a more flexible authorization mechanism. In RBAC, access privileges are granted to roles, not to users directly. Users are assigned appropriate roles based on the organization to which they belong and/or their responsibilities and qualifications in the organization. To adopt RBAC for our scenario, we only have to regard user attributes as roles and to assign appropriate privileges to each role. RBAC is naturally suitable to our model. The second feature enables controlling access to arbitrary parts of a data file. In an analysis chart, each piece of information such as age and patient name is individually written as an XML element. Thus, we require fine grained access control at the level of the XML element. To perform authorization at the XML element level, we must bind the authorization policy for every element in an XML file. Two ways of binding can be used: by specifying the policies in the data files or by keeping the policies only in another file. The latter requires additional information to bind the policies to the data, but the former does not. However, if authorization policies need to be changed, the former requires modifying the policies in all of the relevant files. However, the latter requires modifying only the files for which the policies are written. The third feature is the ease of introduction and management of the authorization system. The workload for introducing the authorization system, e.g., modifying existing systems and resources for adaptation to the authorization system, should be kept as small as possible. Managing the authorization system, such as by modifying authoriza-
T. Tashiro et al. / Architecture of Authorization Mechanism
363
tion policies and adding users, should also be done easily. In particular, authorization policies increase in complexity as the granularity of access control becomes finer. In addition, users and organizations that form the Grid environment change dynamically and frequently. The cost of system management needs to be as little as possible to respond to these situations flexibly.
3. Comparison and Discussion of Authorization Systems In this section, we compare and discuss some major authorization systems based on three requirements given in Section 2.3. 3.1. GSI GSI provides a simple authorization mechanism called a “grid-mapfile.” A grid-mapfile allows resource administrators to map Grid users to their local user. The grid-mapfile is located at each host that provides resources to the Grid, and this file contains mapping information between Grid-wide identities of users and local user accounts in the host. The final authorization decision using GSI on a resource depends on what kind of local authorization mechanism is managed on the resource. In other words, if traditional file access control is used on the host, the information on the users’ privileges that is bound to each file or directory is used. If the host is running an RBAC-compatible OS, RBAC is utilized for access control to the data resources. This fact indicates that the final authorization decision cannot be controlled from GSI, and thus, we cannot enforce RBAC authorization on the data resources and control the granularity of authorization. 3.2. CAS The Community Authorization Service (CAS) [7] leverages OS-inherent file permission system to control the Grid users’ privileges. Technically, the functions of GSI are used inside the CAS. As stated earlier, CAS is basically not capable of supporting RBAC and XML element level access control. In GSI, only an administrator at each resource in the Grid environment can restrict Grid users’ actions on the resource, whereas, CAS allows the Grid administrator to restrict the users’ actions. In CAS, all users are, at first, granted the same privileges by the resource administrators. After that, the Grid administrator adds a restriction to each users action according to the policies of this Grid environment. The central servers in the Grid environment manage these policies. 3.3. PRIMA PRIMA [8] also performs authorization by means of user mapping to target the resources’ local account, in a manner similar to GSI and CAS. Consequently, PRIMA does not support RBAC and XML element level access control. PRIMA is different from GSI and CAS in terms of dynamic creation functionality of local accounts. This functionality saves resource administrators the trouble of having to create accounts and to modify their privileges each time the authorization policies change.
364
T. Tashiro et al. / Architecture of Authorization Mechanism
3.4. VOMS The Virtual Organization Management Service (VOMS) [9] provides the functionality of assigning user attributes (roles or group memberships) to Grid users. This means VOMS is capable of supporting RBAC. Users are independently assigned privileges at each resources based on these attributes. However, VOMS does not have the functionality of access enforcement, that is, Policy Enforcement Point (PEP). VOMS needs to interact with other authorization mechanisms and PEP for authorization. 3.5. PERMIS PERMIS is an implementation of the RBAC mechanism based on the use of X.509 Attribute Certificates (ACs). ACs are used to store authorization policies. Authorization policies specify which roles are assigned to which users, which roles are granted to perform which actions on which targets, etc. Privileges are assigned to roles, not users. Roles are assigned to grid-wide identities of users. Targets are specified as the URL of web services or the distinguished name (DN) of an LDAP subtree. Parts of a file, such as elements of an XML file, cannot be specified as targets. Actions indicate actions or methods that users are allowed to perform on the targets. An action is the smallest granularity of access to a target. In PERMIS, authorization decisions, whether access is either granted or not, are performed system-internally and independently of target resources. This means that no modification in resources is required to introduce PERMIS. Although PERMIS cannot control access at the level of the XML element, it allows to control over actions performed on resources. Thus, if the services have methods to access every element of target XML files individually, PERMIS can perform access control at the XML element level. However, this approach requires creating services that depend on the format of target data files. 3.6. Akenti Akenti [10] provides an authorization mechanism based on three types of XML certificates. Akenti uses these certificates to restrict the users actions and to assign attributes, such as role, to users. That is, Akenti supports RBAC. Akenti makes authorization decisions resource-independently by using the relevant certificates for the access requesters and target resources. Although Akenti cannot specify a part of a certain file as a targets, it does allow administrators to specify the actions that users are permitted to perform on the resources. By making use of this feature, Akenti can employ the same approach as PERMIS to perform XML element level access control. In Akenti, however, every target resource such as file and service must have one or more certificates to bind authorization policies. Thus, many certificates are needed to perform finer granularity of access control.
T. Tashiro et al. / Architecture of Authorization Mechanism
365
3.7. Discussion The exploration regarding authorization systems presented in the previous subsections showed that VOMS, PERMIS, and Akenti support the RBAC mechanism. That is, these three systems fulfill our first requirement. However, no systems meet our second requirement, which is access control at the level of the XML elements. Two approaches can achieve element level access control. The first approach extracts every element from an XML file and stores each of these elements in an individual file. This approach is applicable to GSI, CAS, and PRIMA. The second approach provides services that have methods to access every element of the XML file individually. This approach is applicable to PERMIS and Akenti. One obvious disadvantage of the first approach is that it requires substantial effort to extract elements from a number of XML documents. Moreover, this approach involves a lot of work when authorization policies are modified. For example, if a user’s privilege is changed to access a particular type of element, every file that keeps this type of element needs to be changed. In contrast, the second approach requires only changing the relevant authorization policies. For these reasons, PERMIS and Akenti are suitable candidates for authorization mechanisms in developing our data sharing system. We compare PERMIS and Akenti in terms of ease of authorization policy management, which is our third requirement. In PERMIS, all authorization policies can be contained in a single AC. Whereas in Akenti, every resource has one or more individual ACs to bind its relevant authorization policies. The number of ACs for Akenti needs to be equal to or grater than the number of resources. Consequently, Akenti requires substantially more effort on the part of administrator pertaining to managing authorization policy in comparison with PERMIS. Therefore PERMIS is the most suitable solution for our scenario. In the next section, we describe the details of applying of PERMIS to our scenario.
4. Architecture Design using PERMIS This section shows the architecture design of our medical database sharing system with fine-grained authorization using PERMIS. 4.1. Overview of Architecture The Medical Database Sharing System is designed to provide the following two functions: RBAC and fine granularity access control. Reading and writing actions to the analysis chart are restricted based on user roles. These actions are performed at the level of XML elements to permit users to access different parts of a single chart, depending on the users’ roles. The interaction between the system and a user is outlined in Figure 3. The user submits an access request to an analysis chart to the data providing service via the portal service. The data providing service forwards this request to the data acquisition service. Then, the data acquisition service asks the PERMIS authorization service if this request is granted or denied. The PERMIS authorization service retrieves ACs relevant to this request from the policy repository, and it makes the decision based on these ACs. The
366
T. Tashiro et al. / Architecture of Authorization Mechanism
Figure 3. Architecture of Our Data Sharing System Using PERMIS
data acquisition service transfers the target elements to which the user is allowed access back to the data providing service. Finally, the data providing service constructs an XML document from these elements and returns it to the user via the portal service. 4.2. Authorization Flow In this subsection, we describe the authorization flow in detail and show how XML element level access control is performed. The authorization flow in our system can be divided into three steps: request, decision, and data presentation. The access request from the user to the portal service is forwarded to the data providing service. The data providing service sends an access request with the user’s identity to the data acquisition service based on the forwarded user request. The data acquisition service provides every element in the analysis chart with a read/write interface. Access to these methods is permitted if the PERMIS authorization service allows it the users role. The second step of the authorization flow is access control decision. The PERMIS authorization service retrieves ACs relevant to the access request from the AC repository. The AC repository is a set of local directories and/or LDAP directories. These ACs are used to evaluate which role can be assigned to the user and if the requested access can be granted to the role. If the access is not permitted or if any AC necessary to make this decision is lacking, the PERMIS authorization service returns a denial message. The final step is the data presentation. If the request is permitted by the PERMIS authorization service, this service returns “Granted” to the data acquisition service. Then, the data acquisition service retrieves the requested analysis chart from the data repository and extracts the element targeted by the permitted method. The data providing service leverages methods of the data acquisition service multiple times to gather the elements of the analysis chart. Finally, the data providing service creates an XML document that consists of the gathered elements. This document is forwarded to the portal service and displayed to the user.
T. Tashiro et al. / Architecture of Authorization Mechanism
367
User authentication for each service is conducted by GSI-inherent proxy certificate propagation [3].
5. Conclusion In this paper, we explored and investigated existing authorization mechanisms based on three requirements for building a medical data-sharing environment where medical doctors and scientists can securely share clinical data and medical research data. The architecture of the medical data-sharing environment takes advantage of PERMIS to achieve RBAC and XML element level access control for the analysis chart. However, because our architecture is specialized for only a certain data format, we will work to improve our architecture to make it more flexible for a variety of data formats. Furthermore, we have found that the authorization system for the Grid still has too many methods for implementation because of an absence of a standard mechanism for granting authorization on the Grid. We will work on research related to authorization on the Grid in the future.
Acknowledgements This work was supported in part by the Osaka JGN II Research Center at the National Institute of Information and Communications Technology (NICT).
References [1] I. Foster, et al: The Anatomy of the Grid -Enabling Scalable Virtual Organizations-, International Journal Supercomputer Applications, 15 (3), January 2001. [2] The Globus Alliance: The Globus Toolkit, http://www.globus.org/toolkit/. [3] R. Butler, et al: A national-scale authentication infrastructure. IEEE Computer, 33 (12), 60– 66, 2000. [4] R.S. Sandhu, et al: Role Based Access Control Models, IEEE Computer, 29 (2), 38–43, 1996. [5] D.W. Chadwick, et al: The PERMIS X.509 Role Based Privilege Management Infrastructure, Proc. of the seventh ACM symposium on Access control models and technologies 2002, 135– 140, 2002. [6] Department of Defense: Trusted Computer Security Evaluation Criteria, DoD 5200.28-STD, 1985. [7] L. Pearlman, et al: The Community Authorization Service: Status and Future, Proc. of the 2003 Computing in High Energy and Nuclear Physics, CoRR cs.SE/0306082, 2003 [8] M. Lorch, et al: The PRIMA System for Privilege Management, Authorization and Enforcement in Grid Environments, Proc. of the Fourth International Workshop on Grid Computing - Grid 2003, 109–116, 2003. [9] R. Alfieri, et al: VOMS, an Authorization System for Virtual Organizations, Proc. of the First European Across Grids Conference, 33–40, 2003. [10] M.R. Thompson, et al: Certificate-Based Authorization Policy in a PKI Environment ACM Trans. on Information and System Security, 6 (4), 566–588, 2003.
368
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Database Integration for Predisposition Genes Discovery François-Marie Colonna , Yacine Sam and Omar Boucelma LSIS, UMR 6168, Université Paul Cézanne Aix-Marseille 3 Abstract. Lightweight and flexible data integration is nowadays a great challenge in biology. Databases size increased exponentially, and biologists need tools to extract and merge interesting data coming from several heterogeneous and distributed sources over the Internet. Furthermore, there is a great need for data mediation on grid computing architectures. This paper is about a demonstration which goal is to show how Mediation, a data integration technique, can help in building such bioinformatics tools. We focused on research of genes functions, especially those involved in the resistance to malaria or those that may be considered as predisposition genes. Keywords. integration, databases, biology, malaria
1. Introduction Scientists today spend significant time and effort in querying multiple remote or local heterogeneous data sources, and integrating the results, either manually, or with the aid of data integration tools, so that they may be further manipulated using advanced data analysis and visualization tools. Biological resources are either publicly available on the Web or local and private ; they also may include thousands of data sources. Biologists often need to merge information extracted from these data sources, widespread over the Internet. Even if time consuming operations as led by the Folding@Home Project (see [Folding, 2006]) are distributed over a worldwide computing network, there is a great need for data mediation on the grid to exploit those results and crosscheck them with data extracted from other sources. Due to entities sizes, variety of representation formats, data models, syntax (see Figure 1) and semantics, this work is becoming nowadays infeasible by hand. Gathering data squares in computer science as a data integration problem, which is still a great challenge to step in biological domain. Commonly used approaches for database integration are warehousing and mediation, the first is based on data rewriting and construction of a data warehouse while the other concentrates on query rewriting, based on sources capabilities. We developed and tested a generic tool based on relational algebra as a basis for integrating data from genomic structured or semi-structured sources. Our approach is using http protocol as a framework for the computing platform. We compute a score and obtain lists of genes involved in resistance or predisposition to a particular disease. In this demonstration, we focused particularly on malaria. 1 Correspondence
to: colonnaf@lsis.org
369
F.-M. Colonna et al. / Database Integration for Predisposition Genes Discovery Chromosome 5 400 genes
SwissProt Accession Number EMBL
GeneW GeneCard Prosite
P29460 M65272, M65290 AY008847, AF180563 AF512686 HGNC:5970, IL12B IL12B PS50853, PS01354 PS50835
Figure 1. Various names of gene IL12B
Priority list of genes for study
5q31−q33
Figure 2. Area of interest on chromosome 5
2. Illustrating Example with Malaria Malaria is caused by protozoan parasites of the genus Plasmodium: Plasmodium falciparum (the only one mortal), Plasmodium vivax, Plasmodium malariae and Plasmodium ovale. The subjects infected by the parasite can present no symptom, or develop a simple form of the disease (paludal crisis) that can evolve seriously (severe anaemia, respiratory distress, or neuropaludism). Malaria depends on different factors, like inadequate health structures and poor socioeconomic conditions, environment, genetic characteristics of the host and the parasite: it’s a multifactorial disease. A clinical study , of genetic links analysis performed in Burkina Faso, resulted in two groups of genes in two chromosomal areas, possibly involved in the disease. The first one is area 5q31-q33 on chromosome 5 [Rihet et al., 1998], and the second one is area 6p21-p23 on chromosome 6 [Rihet et al., 1998]. These regions contain respectively 400 and 700 (known or not) genes. The areas of interest of chromosomes have been confined, but there are too many data sources for predicting “by hand”, which gene is more likely involved in malaria’s evolution. We need to establish a priority list by interoperating data from several sources, and the computing a score to guide geneticists’ gene study.Since malaria is a multifactorial disease, there can be numerous integrated sources; in this paper, we mainly focus on data sources dealing with biological human factors. The understanding of the global processes of the vitals phenomena requires automatic processing, as much as possible. This is why data integration has become a major research field within the bioinformatics community.
3. Application Scenario We adopted a mediation approach to perform queries over a set of linked data sources such as chromosomal regions and malaria. The process is summarized in Figure 3. Given a list of genes, we query several data sources and extract pertinent attributes, which are passed to data-mining operators for performing a gene scoring. With the help and input from biologists, we focused particularly on seven publicly available databases: Proteins: SwissProt
370
F.-M. Colonna et al. / Database Integration for Predisposition Genes Discovery
Mutations: dbSNP, HGVBase, SNPper Bibliography: PubMed Gene: UCSC, LocusLink We also integrated partial references from OMIM [OMIM, 2005]: identifiers used during the scoring phase. In doing so, we had to face the registration process as described in [DiscoveryLink, 2001]. For each source, we identified their capabilities: location, data structure, models, schemas, and query engines. As we deal with few sources, and want an easy rewriting step, we adopted a Global As View mediation approach or GAV [Levy, 1999]. Local and global schemas are based on the relational model, and queries are expressed in SQL, allowing the expression of complex queries using joins, unions, etc. The biological mediator is coupled with a small/local warehouse, used during the prediction phase.
UCSC
List of genes in the interval
Translation
PubMed
Articles
HGV
dbSNP
SNPer
SNPs
LL
GO
SP
IPR
Score computation
Genes list with priority for study
Figure 3. Calculation of a score for each gene
3.1. Sources Mapping Building our own mapping and mediation engine is still a work in progress, especially due to complexity of query rewriting and data mapping. For the purpose of our prototype, we used Medience Server, based on research works conducted at INRIA. Medience provides a user friendly interface for mapping heterogeneous sources to a relational schema. Queries can be addressed to local or global tables using SQL. Relational algebra provides powerful join and union operators. Furthermore, the correspondence manager GUI is an advantage. We can specify or mapping. Adaptive mapping is used for advanced changes within data structure. Let and be two local relations, and and their respective attributes. To create a global attribute _ depending on local attributes values, we simply write:
F.-M. Colonna et al. / Database Integration for Predisposition Genes Discovery
371
IF (L1.A1 LIKE ’P01%’) AND (L2.A2 LIKE ’%HUMAN’) THEN protein_name = CONCAT(L1.A1,L2.A2)
(% means any number of characters) This data transformation can also homogenize numerical values like SNPs frequency in some databases. A manual expressed as a percentage or as values in interval mapping rules description is feasible only with few source; it became a huge work for large scale integration. We think that work done in [Manoah et al., 2004] dedicated to GIS sources can be adapted to biological domain if we change the corpus used by the execution engine. This could result in a semi-automated mapping, as some critical cases are not yet solved automatically. SELECT distinct ISOFORM.iso_id, SEQUENCE.title FROM ISOFORM, SEQUENCE WHERE ISOFORM.seq_id = SEQUENCE.ic_acckey AND ISOFORM.location LIKE ’mrna-utr’ Figure 4. Data extraction from global schema
Once integrated, data needs to be analysed. The local warehouse contains a set of genes, and each of these genes needs to be associated or not to malaria. This resolution method is based on associations between terms describing the disease. The relationships between terms are identified and scored using a method based on fuzzy sets theory described in [Perez et al., 2004]. 3.2. Computation on Fuzzy Sets Basically if we have two sets of terms , and two terms , , we can define a fuzzy binary relation in to measure their degree of association. This association is computed using a subset of abstracts from a bibliographic bank (formula in Figure 5). The composition of two membership functions is used to associate terms of three or more sets, as shown in in Figure 5. Strength of terms association: (where
is the subset of articles containing both and , and of articles containing or ) Max product composition:
is the subset
Figure 5. Calculation of a score for each gene
3.3. Results Discussion For the predictions, we used OMIM, MeshC and GO terms. We first choose an entry in OMIM that best describes the studied disease. In our case, “Malaria, intensity of infection” with id 248310. We define the relations , and . is the subset of most frequent MeshC terms from relation .
372
F.-M. Colonna et al. / Database Integration for Predisposition Genes Discovery
is the set of annotated genes stored in our warehouse. With relations and , we link the OMIM entry to the GO terms annotating our genes (column “GO terms” in Table 1). Next step is to estimate the degree of association between the OMIM term describing malaria, and the whole of GO terms (column “Annotated GO terms” in Table 1). A frequently annotated gene is potentially more interesting than others relatively to the total of annotations available. The final list is ordered by the ratio of the two columns (Table 1). Table 1. Genes potentially involved in malaria Gene
GO terms
Annotated GO terms
Ratio
CYFIP2 IL13
2 2
3 3
66% 66%
CD14
5
8
62.5%
ETF1 NRG2
3 3
5 6
60% 50%
IL5
3
7
43%
G3BP
3
15
20%
4. Concluding Remarks This paper describes a data integration system for genomic data and its use in identifying some genes that may be explain how some people behave differently with respect to Malaria. A comparison between experimental results (done by hand) and those obtained via the mediation engine prototype showed that the capabilities of the tool are quite efficient. Genes in the list returned by our system also did appear in the list generated by hand, in using two different ways, as detailed in [Rogier, 2004]. There is a slight difference: the manual process returns a longer list. The prototype does not integrate partially available data; an entity with several NULL attributes is not considered, whereas it could be manually. These experiments feedback have also highlighted future research areas especially on algorithmic aspects of queries executions and lightweight data integration using XML. Query processing consists currently in splitting a query poses against the global schema; while it needs to be extended in order to process complex workflow schedules, mimicking the steps followed by biologists while crawling the web for extracting informations.
References [Rogier, 2004] Rogier, O. (2004) Recherche de gènes candidats pour le paludisme. Mémoire de DEA BBSG. [Rihet et al., 1998] Rihet, P., Traore, Y., Abel, L., Aucan, C., Traore-Leroux, T. (1998) Malaria in humans: Plasmodium falciparum blood infection levels are linked to chromosome 5q31-q33. American Journal of Human Genetics, 63, 498-505. [Rihet et al., 1998] Rihet, P., Abel, L., Traore, Y., Traore-Leroux, T., Aucan, C. (1998) Human malaria: segregation analysis of blood infection levels in a suburban area and a rural area in Burkina Faso. Genetic Epidemiology, 15, 435-450.
F.-M. Colonna et al. / Database Integration for Predisposition Genes Discovery
373
[Perez et al., 2004] Perez-Iratxeta, C., Bork, P. and Andrade, M.A. (2002) Association of genes to genetically inherited diseases using data mining. Nature Genetics, 31, 316-319. [OMIM, 2005] OMIM (2005) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM [DiscoveryLink, 2001] Haas,L., Schwarz,P., Kodali,E., Kotlar,E., Rice,J., Swope,W. (2001) DiscoveryLink: A system for Integrating Life Sciences Data IBM Systems Journal, 40 [Levy, 1999] Levy,A. (1999) Logic-Based Techniques in Data Integration, , Workshop on LogicBased Artificial Intelligence, University of Washington [Folding, 2006] Stanford University (2006) Folding@Home Project (http://folding.stanford.edu/), Pande Group, Chemistry Department, Stanford University [Manoah et al., 2004] Manoah,S., Boucelma,O., Lassoued,Y. (2004) Schema Matching in GIS AIMSA 2004 11th International Conference, Bulgaria, 3192
374
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
High Performance GRID Based Implementation for Genomics and Protein Analysis L. Milanesi and I. Merelli CNR-ITB Institute of Biomedical Technologies CNR Via Fratelli Cervi n. 93 20090 Segrate (MILANO) Italy
Abstract. Starting from the genomic and proteomic sequence data, a complex computational infrastructure as been established with the objective to develop a GRID based system to to automate the analysis, prediction and annotation processes of genomic DNA. To support of this type of analysis, several algorithms as been used to recognize biological signals involved in the identification of genes and proteins. The system implemented can be use to analyse the content of the large number of genomic sequences. For this reason, the system realized is capable of using a computational architecture specifically designed for intensive computing based on GRID technologies developed throughout the BIOINFOGRID European project. We developed a GRID based workflow to correlate different kind of Bioinformatics data, going from the Genomics Nucleotide to the Protein Sequence. The first step in the workflow consists of submitting a nucleotide sequence that is elaborated by a specific software for gene prediction. In particular this tool performs a search in the nucleotide sequence to find out the key components of gene. The predicted gene is then translated in the corresponding protein sequence. Based on protein sequence is then possible to identify the domains that characterize the protein functionality using specific tools of domain prediction. Protein domains classification are very important in the analysis of the macromolecular functionality. To analyze a whole protein family from large genome of various organism means to elaborate a large amount of data that requires huge computational resources. To analyze all this data we suggest the use of a high performance platform based on grid technology. We have implemented our applications on a wide area grid platform for scientific applications [http://www.grid.it and http://grid-it.cnaf.infn.it] composed of about 1000 CPU’s. The grid infrastructure consists in a collection of computing elements and storage elements that jointly concur to define a platform for high performance elaboration. In this study a grid based application is presented to compute the protein domain analysis in a distributed way. This approach has high performance because the protein domains are checked with different software in parallel in different grid sites.
1. Introduction In sequence-based proteomics it is very important to understand the functionality of the protein. The typical approach is to perform the domain analysis of the protein sequence. Discovering similarities between the input sequence and the domain databases allows us to find the domain protein composition [1].
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
375
This homology analysis is useful to understand the protein structure and moreover to define its functionality and its three dimensional configuration (Fig. 1). Several databases provide classification of proteins on different levels [2]. These databases include information about super-families, families, domains, motifs and sites and provides annotation and links to the member of other databases. Protein structure databases use different methodologies and a varying degrees of biological information on well-characterized proteins to derive protein signatures. The most important protein domains databases are PROSITE [3], PRINTS [4], Pfam [5], ProDom [6], SMART [7], TIGRFAMs [8] PIR superfamily [9] and SBASE [10]. A typical analysis can be performed by using InterProScan [11]. This tool can be used to search for the domains protein pattern in some of these databases. This software usually takes a long time to complete the analysis. When extending this kind of analysis to whole genome protein families the input files become very large and heavy to elaborate [12]. For this reason some parallel implementation of the computation have been implemented using C-MPI standard library on local cluster [13]. Using the C-MPI solution the elaboration time can be reduced depending on the cluster size. But if the input data are at genome scale, the grid platform becomes an efficient solution. In order to use this analysis software with the grid technology a classic approach for the distributed computation has been used. The various domain analysis tasks are distributed on different computational elements in grid. The output data are then collected and integrated. For each analysis tool and database, a different implementation for the grid infrastructure has been chosen in order to maximize the job efficiency. This aspect is highly influenced by the size of the I/O and the database size.
Figure 1. The three dimensional structure of the C protein in which the structural domains are highlighted (1AUT from PDB).
376
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
2. The Software Analysis The databases and tools considered in this study for a high performance grid implementation are freely available under GNU licence agreement from the EBI's ftp server [ftp://ftp.ebi.ac.uk/pub]. The analysis tools used for the domain search in grid are listed below: • • • • • • • • • • • • • •
BlastProDom is a wrapper script on top of a Blast package used to search against PRODOM families. FPrintScan is used to search against the PRINTS collection of protein signatures. HMMPIR is a script that produces the Protein Sequence Database of functionally annotated protein sequences in PIR databases. HMMPfam is used to search against the Pfam HMM database, against SMARTHMM database and against TIGRFAMs collection of HMMs. HMMSmart allows the identification and annotation of genetically mobile domains and the analysis of domain architectures against the SMART database. TIGRfam implements the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs. ProfileScan is used to search against the PROSITE profiles collection. ScanRegExp is used to search against the PROSITE patterns collection and verify the matches by statistically significant CONFIRM patterns. Superfamily is used to search against the SUPERFAMILY database of structures. SignalPHMM for prediction and location of signal peptide cleavage sites, using HMM. TMHMM is used to predict the transmembrane helices in proteins using HMM. PANTHER is a software which queries a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict their function. Seg is used to identify and mask some low compositional complexity segments in amino acid sequences. Coil is used to predict coiled-coil regions, using the algorithm of Lupas.
3. The Grid Platform We have implemented our applications on a wide area grid platform for scientific applications [http://www.grid.it and http://grid-it.cnaf.infn.it] composed of about 1000 CPU’s. The grid platform used is founded on the Globus Toolkit that represents an ideal communication layer between the different grid components [14]. The Globus Toolkit is an open source software toolkit used for building Grid systems and applications. It is being developed by the Globus Alliance and many others all over the world.
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
377
The Globus Toolkit is composed by a series of different services: the security services (GSI), the information services (GIS), the resource management services (GRAM), the data access to storage element (GASS) and migration services [15]. Therefore the use of this communication middleware to maintain the computational grid is of fundamental importance for sharing the calculation resources of different research laboratories to optimize the elaboration times and the data management.
Figure 2. Data Management system of the grid used, in which the Replica Manager coordinates the data stored in each storage element through a file catalog integrated with a set of services.
The computational resources are connected to a Resource Broker that routes each job on a specific Computing Element and takes into account the directives of the submitting script, called JDL because it is composed through the Job Description Language, and implements a load balancing policy. Each Computing Element has a Gatekeeper service that submits the incoming job to a batch system queue that hides a Working Nodes farm. Once the job is finished it is possible to retrieve the output through the personal User Interface. Beside the calculation resources, the grid presents a powerful data management system based on the flat files. A set of tools is used to coordinate these Storage Elements in an effective distributed file system (Fig. 2). These tools allow the data to be allocated to different storage elements, maintaining the coherence between them through a Replica Manager, and the use of this information in an efficient way. The Resource Broker, in fact, is able to redirect the execution of an application to a Computing Element as near as possible to the data location, minimizing the communication time.
378
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
4. System Implementation The implementation of the protein domain analysis on the grid platform consists in creating an efficient system to coordinate the jobs submission, to check the jobs status and to collect the results. In the submission phase the input sequences are stored into the User Interface and the scripts to be executed are chosen. For each analysis software to be performed a JDL script is generated with the information about the input sequence, the job requirement and the information about the databases that have to be accessed. The jobs are routed by the Resource Broker to the best Computing Element that is available at the moment. From the User Interface the execution of the protein domain analysis is automatically monitored by our software and, in case of failure, is resubmitted to the grid infrastructure. When a job is successfully completed the output is retrieved on the User Interface and collected in a result database. When all the jobs of a certain task have been correctly performed the output is generated and users can access it. The major challenge in porting on the grid platform a bioinformatics application, like the protein domains identification, is the management of the database in a distributed way. The protein input sequences, submitted in FASTA format, are analyzed against a collection of domains in a flat file of large size. To perform the protein domain analysis it is then necessary to allocate the databases on different Storage Elements in the flat file form. In this way it is possible to use a high number of different Computing Elements on which the elaboration performs efficiently in parallel. The consistency between the databases is maintained by the Replica Catalog, but when new versions of the databases are released the data have to be manually stored and replicated on the grid infrastructure. In order to make this software rapidly accessible for users a Web Interface was developed. This Web site, called DomainGRID, is used to submit jobs at the grid environment, to visualize in a clear form the obtained results and to hide the complexity of the grid platform. The first step to perform is the user log in for the authorization and authentication. The authorization is confirmed by a grid personal certificate released by the CA (Certification Authority) while the authentication is controlled by a networks of VO (Virtual Organization) servers. To execute a protein domains search it is necessary to choose the tools to be executed on the grid platform and submit a set of protein sequences in the multi-fasta format either pasting a sequence in the textbox or uploading the file directly.
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
379
Figure 3. The protein domain analysis scripts that can be performed on the Italian Grid Platform through the DomainGRID web site.
At this point the users can check the progress of the job execution from their personal home page, while the protein domain search is executed on the grid platform, and can verify the output results in real time.
5. Conclusion and Future Work In this study, a high performance execution for protein domain search has been presented. The software takes protein sequence in multi-fasta format and analyzes it against different proteins domain databases, each containing different protein signatures and recognition methods. On genome scale these tools take a very long computational time and the use of the grid platform considerably reduces the elaboration time. The web interface is designed to make the access to the grid easy and allows the users to control the execution of their jobs. The jobs submission is a simple driven procedure that permits the user to insert the input sequence and to control parameters. Finally, through the command history the users can compare the previous results with the new incoming ones. The protein domains identification on the grid platform can be a good solution to face genome scale bioinformatics analysis.
Acknowledgement This work was supported by the MIUR-FIRB projects: Italian Laboratory for Bioinformatics Technology – LITBIO, Enabling Platforms for High-Performance Computational Grids Oriented to Scalable Virtual Organizations – Grid.it, the EU project 026808 BioinfoGRID and ML by the INTAS project Nr.03-51-5218.
380
L. Milanesi and I. Merelli / High Performance GRID Based Implementation
References [1] [2]
[3] [4] [5]
[6] [7] [8] [9]
[10]
[11]
[12] [13] [14] [15]
Islam, S.A.; Luo, J.; Sternberg. M.J. 1995. Identification and analysis of domains in proteins. Protein Engineering 8:513—525. Marchler-Bauer A.; Anderson J.B.; Cherukuri P.F.; DeWeese-Scott C.; Geer L.Y.; Gwadz M, He S.; Hurwitz D.I.; Jackson J.D.; Ke Z.; Lanczycki C.J.; Liebert C.A.; Liu C.; Lu F.; Marchler G.H.; Mullokandov M.; Shoemaker B.A.; Simonyan V.; Song J.S.; Thiessen P.A.; Yamashita R.A.; Yin J.J.; Zhang D.; Bryant S.H. 2005. XXX. Nucleic Acids Res., 33:192—196. Falquet, L.; Pagni, M., Bucher, P.; Hulo, N.; Sigrist, C.J.; Hofmann, K.; Bairoch, A. 2002. The prosite database, its status in 2002. Nucleic Acids Res,. 30:235—238. Attwood, T.K. 2002. The prints protein fingerprint database: functional and evolutionary applications. Briefings in Bioinformatics, 3(3):252—263. Bateman, A.; Coin, L.; Durbin, R.; Finn, R.D.; Hollich, V.; Griffiths-Jones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E.L.; Studholme, D.J. Yeats, Y.; Eddy, S.R. 2004. The pfam protein families database. Nucleic Acids Res., 32:138—141. Corpet, F.; Servant, F.; Gouzy J.; and Kahn D. 2000. Prodom and prodom-cg: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res, 28:267—269. Letunic, I.; Copley, R.R.; Schmidt, S.; Ciccarelli, F.D.; Doerks, T.; Schultz, J.; Ponting, C.P.; Bork. P.; 2004. Smart 4.0: towards genomic data integration. Nucleic Acids Res, 32(1):142—144. Haft, D.H.; Selengut, J.D.; White, O. 2003. The tigrfams database of protein families. Nucleic Acids Res, 31(1):371—373. Wu, C.H.; Nikolskaya, A.; Huang, H.; Yeh, L.S.; Natale C.R.; Vinayaka, D.A.; Hu, Z.Z.; Mazumder, R.; Kumar, S.; Kourtesis, P.; Ledley, R.S.; Suzek, B.E.; Arminski, L.; Chen, Y.; Zhang, J.; Cardenas, J.L.; Chung, S.; Castro-Alvear, J.; Dinkov, G.; Barker W.C. 2004. Pirsf: family classification system at the protein information resource. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Research, 32:112—114. Murvai J.; Gabrielian A.; Fabian P.; Hatsagi Z.; Degtyarenko K.; Hegyi H.; Pongor S.; 1996. The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments. Nucleic Acids Res. 24(1):210—213. Mulder, N.J.; Apweiler, R.; Attwood, T.K.; Bairoch, A.; Barrell, D.; Bateman, A.; Binns, D.; Biswas, M.; Bradley, P.; Bork, P.; Bucher, P; Copley, R.R.; Courcelle, E.; Udas U.; Durbin, R.; Falquet, L.; Fleischmann, W.; Griffiths-Jones, S.; Haft, D.; Harte, N.; Hulo, N.; Kahn, D.; Kanapin, A.; Krestyaninova, M.; Lopez, R.; Letunic, I.; Lonsdale, D.; Silventoinen, V.; Orchard, S.E.; Pagni, M.; Peyruc, D.; Ponting, C.P.; Selengut, J.D.; Servant, F.; Sigrist, C.J.A.; Vaughan, R.; Zdobnov, E.M. 2003. The interpro database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31:315—318. Marcotte, E.M.; Pellegrini, M.; Thompson, M.J.; Yeats, T.O.; Eisenberg, D. 1999. A combined algorithm for genome wide prediction of protein function. Nature 402:83—86. Jeffrey, M.; Lumsdaine, S.; Lumsdaine, A. 2003. Component Architecture for LAM/MPI. Proceedings, 10th European PVM/MPI Users. Springer-Verlag Lecture Notes in Computer Science. Foster, I.; Kesselman, C.; Tuecke S. 2001. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications 15. Foster, I.; Kasselman, C.; Nick, J.; Tuecke S. 2002. The physiology of the grid: An open grid services architecture fordistributed systems integration, open grid service infrastructure. Global Grid Forum.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
381
TRENCADIS – A WSRF Grid MiddleWare for Managing DICOM Structured Reporting Objects Ignacio Blanquer, Vicente Hernandez and Damià Segrelles Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas, Universidad Politécnica de Valencia, Camino de Vera S/N, 46022 Valencia Spain (phone: +34 96 387 7007 ext.88254; Fax: +34 96 387 7274; e-mails: iblanque@dsic.upv.es; vhernand@dsic.upv.es; dquilis@itaca.upv.es) Abstract. The adoption of the digital processing of medical data, especially on radiology, has leaded to the availability of millions of records (images and reports). However, this information is mainly used at patient level, being the extraction of information, organised according to administrative criteria, which make the extraction of knowledge difficult. Moreover, legal constraints make the direct integration of information systems complex or even impossible. On the other side, the widespread of the DICOM format has leaded to the inclusion of other information different from just radiological images. The possibility of coding radiology reports in a structured form, adding semantic information about the data contained in the DICOM objects, eases the process of structuring images according to content. DICOM Structured Reporting (DICOM-SR) is a specification of tags and sections to code and integrate radiology reports, with seamless references to findings and regions of interests of the associated images, movies, waveforms, signals, etc. The work presented in this paper aims at developing of a framework to efficiently and securely share medical images and radiology reports, as well as to provide high throughput processing services. This system is based on a previously developed architecture in the framework of the TRENCADIS project, and uses other components such as the security system and the Grid processing service developed in previous activities. The work presented here introduces a semantic structuring and an ontology framework, to organise medical images considering standard terminology and disease coding formats (SNOMED, ICD9, LOINC…).1
1. Introduction and Objectives The work of this paper is related to the TRENCADIS (Towards a Grid Environment for Processing and Sharing DICOM [1] Objects) project, which proposes the definition of a generic and secure Service Oriented Architecture (SOA) for sharing, searching and processing DICOM objects using different ontologies. 1
This work is partially funded by the Spanish Ministry of Science and Technology in the frame of the project “Investigación y Desarrollo de Servicios GRID: Aplicación a Modelos Cliente-Servidor, Colaborativos y de Alta Productividad”, with reference TIC2003-01318.
382
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
The main objectives of the work described in this paper are the design and implementation of an architecture and its services, able to support the management of structured reports using Grid technologies. The TRENCADIS project aims at the development of a high-level object-oriented interface and a Grid Middleware to provide the management of DICOM objects, abstracting users from the particularities of Grids and other Information Technologies. More precisely, the work objectives are the following: • • •
To create and manage DICOM Structured Reports [2] (DICOM-SR) using high-level interfaces. We want to develop a component, and integrate it in the Middleware Grid, which is able to create structured reports in DICOM format. To create and manage DICOM-SR storages for searching, uploading and downloading DICOM-SR documents and their associated images, using highlevel objects. These storages can be created depending of a given ontology. To use the Internet as the communication network, dealing with the problems of security and latencies.
Section number two deals with the description of the state of the art of the Grid technologies and the DICOM structured reporting; section number three describes a summary about the architecture used; section number four the Grid services dealing with the management of reports; and in section number five is described the high level components defined for managing, downloading and uploading DICOM-SRs. Finally will be exposed the conclusions and future works in the last section.
2. State of the Art Sharing structured reports would be a large advantage in research and clinical decision supports. Medical communities working in specific areas and diseases would benefit of accessing to a larger set of examples, relevant cases and diagnosis. The approach for sharing of Radiology Reports in this work requires from standard coding of terminologies and diseases, the coding of medical imaging and related information, Grid and Web Service technologies and Ontology specifications. This section describes the status of these basic technologies. 2.1. Grid Technologies Many efforts are being put on the application of Grid Technologies to Medical Imaging storage. Projects such as European DataGrid (EDG) [3] developed a very efficient framework for the distributed management of data in different areas of science. Application of this approach to medical imaging is being performed in projects such as the BIRN [4] or the MEDIGRID ACI project [5]. These projects have developed their own technology to share computing and storage, extending the functionality of widespread middlewares such as the Globus Toolkit. On the other side, advances in general purpose middlewares have been produced. The Open Grid Services Architecture (OGSA) [6] [7], defined by the Global Grid Forum [8] (GGF), defines a standard and open architecture for the development of applications based on Grid Technologies. The Web Service Resource Framework [9] (WSRF) is a specification developed by the OASIS [10]. It is a simpler implementation
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
383
of OGSA and solves the problem of the stateless nature of the Web Services and provides with the basic features that require the Grid Services. WSRF uses Web Services as the natural interface and provides an easier integration to the Web environment. The architecture proposed in this work uses standard components and is fully based in OGSA - WSRF, fostering the interoperability with future Grid infrastructures to be deployed in the different eScience programmes. 2.2. Structured Reporting Digital Imaging and Communications in Medicine [1] (DICOM) is the standard used and adopted traditionally for the most of primary care services centres and hospitals in the radiology and cardiology areas for managing digital images. Due to the use of DICOM in new medicine areas, DICOM has had some extensions for covering the new needs that these areas have generated (signals, videos, waveforms, structured reports, etc…). One of the most recent extensions of DICOM has been to define the way to introduce structured reports for adding semantic information about the data that are contained in the DICOM objects. Information such as radiology reports and treatments are being coded into DICOM objects. DICOM Structured Reporting [2] (DICOM-SR), is a specification of tags and sections to code and integrate radiology reports with seamless references to findings and regions of interests of the associated images, movies, waveforms, signals etc. The final objective of this extension is to share DICOM headers and enable its consideration as DICOM objects, similarly to any other object. It is important to notice that due to the different image modalities and medical specialities that use DICOM standard today, the specification of general-purpose structured reports to manage the information in a homogeneous and standard way is not possible. Structuring radiology reports is a very important task and constitutes a step beyond coding, since it offers a homogeneous way to structure the reports enhancing the capability of tools to extract knowledge and to search and compare reports. The data that have to be inserted in a structured report must have a given structure in order to enable the efficient exploration of the information. Currently, plain text is normally used to introduce additional information on the DICOM objects. Thus, data is not exploited for research at large scale in practice, since they lack from logic structure that could make simple searching operations feasible. The structure reporting is the framework in which the models and the ontologies are defined according to the information that is relevant to a specific pathology and modality. Measurements, references to images, findings labelling and other records are the basic components of structured reports. One of the most important points to make concepts interoperable is the use of coding. This is a fundamental and basic issue for structured reports. There is a lot of different coding depending on the objective (diagnosis, treatments, etc.). Widely used codifications in Structured Reporting are International Coding of Diseases (ICD9) [11], Systematized Nomenclature of Medicine (SNOMED) [12], Logical Observation Identifiers Names and Codes (LOINC) [13], among others. In particular, DICOM-SR permits to combine different standard codification and even own codification schemas for organizing the structured reports; it contains medical information in a hierarchal structure. Depending on the procedure, different analysis and sections are included. In other words, there are templates for each
384
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
structured report that is depending of the procedures that are applied; these templates have been defined by the American College of Radiology - National Electrical Manufacturers Association (ACR-NEMA). On the other hand, TRENCADIS allow define ontologies above DICOM objects for creating virtual storages. These ontologies define the fields and the structure in witch the information must be specified in the DICOM objects, in the case of Structured Reporting according with the sections that are defined in a given template. This structure will enable the creation of index tables that references subsets of the information according to communities, experiments and searching results respect to the DICOM-SRs. Others standards also incorporate the structured report in a native form. HL7 provide the Clinical Data Architecture (CDA) [14] that was incorporated in 1997 for structuring the semantic content of clinic documents. The information coded in the CDA documents is through the Extensible Markup Language (XML) standard [15]. Nowadays, HL7 don’t give a rigorous structure for CDA, for this question we use the DICOM-SR for managing structured reports. We want to use the Grid technologies for apply to DICOM objects in general and their associated information codified to DICOM standard (DICOM-SR).
3. Trencadis Architecture The architecture defined in TRENCADIS project is horizontal and can deal with in different type of objects of the DICOM standard (medical images, structured reports, waveforms etc…).
Figure 1. General Scheme of TRENCADIS Architecture.
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
385
In particular, this paper focuses on the management of structured reports although the architecture definition is the same as for the management of other kind of DICOM objects in TRENCADIS. This section briefly explains the different layers defined in the TRENCADIS architecture and the features associated to it. TRENCADIS is a ServiceOriented Architecture (SOA) in which the usage of resources is represented as Grid services. As described in figure 1, TRENCADIS comprises five layers: •
•
•
•
•
Core Middleware Layer. This layer provides with the basic resources (data bases, computers of high performance, etc…) of the environment, which are offered as services using well defined standard interfaces (Web Services Definition Languages [16] WSDL), protocols (Simple Object Access Protocol [18] SOAP, GridFtp [17]) and data formats (xml Schemes). It provides upper layers with a unique interface to all resources of the same type. For example, a DICOM storage of structured reports can be implemented using relational databases or a plain directory in a hard disk, but the interaction interface will be the same. Server Services Layer. This layer defines the services that implement server tasks. The services can be generic or specific for a given task. For example, a generic service could be a service for locating the resources that are implemented in the Core Middleware and Server Services Layer. This layer interacts directly with the services from Core Middleware and Server Services Layer and with the upper-level components. Communication Layer. It defines the protocols that will be used by the services that have been implemented in the lowest layers (Core Middleware and Server Services). GridFTP protocol is used for transferring a large amount of data, and SOAP above HTTPS is used for interacting with the services. Components Middleware Layer. This layer contains the highest components that interact with services of the Core Middleware and the Server Services Layer. These components provide the applications with an object-oriented interface for the development of the applications for managing, processing and sharing DICOM objects in general. Applications Layer. This layer comprises the applications for managing, processing and sharing DICOM objects. Section 6 describes an application implemented for creating and searching DICOM structured reports.
Finally, the security is another key issue in the architecture. There are three main issues dealing with security: Access to services, communication and privacy of data. For the first two objectives, the system uses duly signed X.509 certificates and secure communication protocols based in Secure Socket Layer [20] (SSL). In the same context, and for the interaction with the services, SOAP is used on top of HTTPS. For large blocks of data secure GridFTP is used. On the other side, data privacy must be preserved even when the data are stored on a remote location. TRENCADIS is being designed to interface with an encryption and decryption storage [19] that ensures that even users with administrative privileges at local sites cannot access the information even if they could access the data files. Only users with the right privileges can access the key servers, build the decryption key and therefore decrypt the object. Finally, the authentication of users is solved using SSL and X.509 certificates.
386
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
4. Grid Services This section includes a brief description of the Grid Services that have been implemented in the Core Middleware and Services Server Layer, which are needed for creating components for sharing, searching, downloading and uploading DICOM-SRs. These components can be used for developing different applications, as knowledge databases, training environments or clinical decision support tools. Input and output from the components of the different layers are coded into XML documents. 4.1. Core Middleware Layer This layer defines the basic Grid Services that interact directly with the resources. Only the components concerning the management and sharing of DICOM SRs are described. Further information on the rest of the components can be found at [21][22][23]. 4.1.1. Storage Structured Report Service It offers the services that are needed for sharing DICOM-SRs and DICOM objects in general. Although only a plain directory storage resource is implemented, the interfaces are very general and open to many other possible implementations, such as Relational Databases. The interfaces that offer these services are the following: • • •
ReportInsert: It inserts a structured report in the DICOM storage and updates the storage broker. xmlSearch: It searches structured reports in the DICOM storage and returns the information. xmlSRDownloadInit: It prepares the structured report for downloading. The information is returned in an XML structure previously defined.
4.2. Server Services Layer This layer defines four Grid Services that are necessary to achieve the objectives of creating DICOM storages. Several services have a general functionality in the TRENCADIS middleware (such as the Index Information Service and the Storage Broker Service) and other services are specific to the DICOM-SR management (such as the Code Service and the Ontology Server Service). 4.2.1. Index Information Service It locates the services in the Grid infrastructure and also provides the most relevant information from them. There is only one interaction interface: •
xmlSearchResource: It searches a resource that has been previously registered in this service and returns its location on XML format.
4.2.2. Storage Broker Service This service saves information related to DICOM objects, from the Storage Structured Reports Services, for indexing the information in an optimal way when a search is required. The interfaces are the following:
387
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
• •
xmlGetDICOMStorage: It searches the storage DICOM that matches with the filters that the user defined previously. iUpdateDICOMStorage: It updates the information of a DICOM object in the indexing system.
4.2.3. Codes Service This service is in charge of the coding schema and values that are used in the structured reports. These services provide the codes and their meaning that they are required. The interfaces are the following: • • •
xmlGetCodes: It gets all the codes of a given schema in an XML document. xmlGetCodeSchema: It gets the complete information about a given code and schema in an XML document. xmlGetAllCodeSchema: It gets all the codes and their complete information from a given schema.
5. Middleware Components The highest-level components are described in this section. These are the components of the TRENCADIS architecture that interact directly with the applications. Furthermore an application for managing, sharing and searching structure reports has been implemented on top of the architecture using these components. Application Layer
Core Middleware Layer
C_GRID_Session NEURO VO C_GRID_DICOM_SR_Upload C_GRID_DICOM_SR Code Server Service
iRegistre
ad _S R
xmlSearch
xmlSearchResource
xmlGetDICOMStorage (Storage SR )
Storage Broker Service
Up lo
C_GRID_SR_Storage R1 231 Virtual Repository R1 342 MRI_Neuro_Stroke R2 332
iReport Insert
Index Service
x mlGetCodes x mlGet CodeSchema
xmlSRDownloadInit
LOINC SNOMED IDC9
iUpdateDICOMStorage
Structured Report Storage Service Services Server Layer
ReportSearch
231 234 342 356
MRI CT MRI MRI
ReportInsert
Neuro Neuro Neuro Neuro
St rok e St rok e St rok e Tumour
Structured Report Storage Service ReportSearch
Female Male Male Female
201 214 332 656
MRI CT MRI CT
ReportInsert
Neuro Cardio Neuro Cardio
Tumour Strok e Strok e Strok e
Male Male Female Female
Figure 2.General Interaction Scheme between Components of Different Layers.
388
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
Figure 2 describes an example of the interaction of all the components and services. This figure shows an example with two independent local repositories containing four studies each one. Repository 1 has four studies of neuroimaging and repository 2, two studies of neuroimaging and two from cardioimaging. These two repositories are virtualized through the Storage Structured Report Service. From all the images available, users of a determined Virtual Organization (VO) will only be able to access to a sub set which will comprise those studies which match the filter criteria of the VO (Neuro body part in the example of the figure). From this sub set, a virtual storage is created for all the images of the VO in which the structured report indicates the presence of a stroke. The index service will create a table for each repository and virtual storage indicating the existence or not of matching images in each repository. More refined queries (e.g. female patients) will be performed in parallel on the repositories that have studies matching the virtual storage criterion. Next subsections describe the SR components. For the shake of the readability, the behaviour of the components is explained based on this example. 5.1. Security Package The first step for interacting with the Grid Services implemented in the lower layers is to open a user session in the Grid environment. For this task, users must create a proxy with their user certificate and the private key. This task is performed by the C_GRID_Session component. The instances of C_GRID_Session are used by the other high-level components for transferring data and interacting with the different Grid Services. In Figure 2, the C_GRID_Session created is referred to a Virtual Organisation called “NEURO” which will be related only to Neuroimaging. 5.2. Structured Report Package This package is responsible of managing and creating new structured reports in DICOM format. This package defines the C_GRID_DICOM_SR component. This component consults and retrieves the data from a DICOM SR that has been previously downloaded. DICOM SR objects can be managed in a group through the C_GRID_Set_DICOM_SR component. Figure 2 shows an instance of a C_GRID_DICOM_SR component. This instance interacts with the resource implemented as a Code Server Grid Service, which provides different codifications (LOINC, SNOMED, etc. . .) for the concepts of the DICOM-SR document. The more important methods used in the Code Server Grid Service are xmlGetCode and xmlGetCodeScheme, retrieving the codes and code schema. 5.3. Structured Report Storage Package This package contains the components for accessing DICOM storages in a transparent way respect. The component C_GRID_SR_Storage enables creating a virtual storage that contains DICOM structured reports from different sites. Each site is implemented as a Storage Structured Report Grid Service in the lower layers. Also this component interacts with other Grid Services. The most important function of this component is the creation of instances needed to locate the sites where the Structured Reports are saved. Figure 2 shows the interaction between the C_GRID_SR_Storage component with the resource Storage Broker Grid Service when an instance of this component is
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
389
created. The component uses the method xmlGetDICOMStorage of the Grid Service for retrieving the location of Storage Structured Report Grid Services. The C_GRID_SR_Storage component knows the location of the Storage Broker Grid Service by consulting the Index Grid Service using the xmlSearchResource method. Other important operation of the C_GRID_SR_Storage component is the searching of DICOM objects and the retrieval of information from Storage Structured Report Grid Services. The xmlSearch method implemented in the Grid Services retrieves the list of DICOM object that match a search criteria. The search in each Storage Structured Report Grid Services is independent and is executed in parallel in the different Grid Services involved. The inclusion of new reports (iReportInsert), will potentially lead to an update of the index service, if a repository that did not include any relevant object is the target storage for a new, and relevant, report. 5.4. Upload/Download Structured Report Package This package provides the components for uploading structured reports created using the C_GRID_SR_DICOM component from client sites to the DICOM storages Grid Services. It also provides the components for downloading the Structured Report from the DICOM storages Grid Services to the user system by using the C_GRID_SR_DICOM. The components for uploading or downloading Structured Reports are C_GRID_DICOM_SR_Upload and C_GRID_DICOM_SR_Download. Both use an instance of C_GRID_SR_DICOM. Figure 2 shows the interaction with the Storage Structured Report Grid Service using the xmlSRDownloadInit method, which prepares the structured report for being downloaded. Other component in this package is C_GRID_DICOM_SR_Set_Download that manages the downloading of a set of DICOM structured reports. These components also interact with the C_GRID_Download_Order component that defines the order in which the C_GRID_Download_set_SR_DICOM downloads the structured reports. 6. Application To demonstrate the usage of the components defined in the Middleware Components Layer, an application for Clinical Decision Support has been implemented. The main objective is the creation of virtual storages of digital images for a given ontology and the creation of DICOM-SRs that reference them. The main functionalities that this application offers are the following: • • • •
To initiate a Grid Session for using the different Grid Services or resources deployed in the same Virtual Organisation, using X509 user certificates. To create a Virtual Storage choosing a given ontology of digital images. To create a DICOM-SR for a set of digital images. Each DICOM-SR follows a given ontology and can be uploaded to the same site where the related images are. To search DICOM-SRs using the fields defined in the ontology that has been chosen by the DICOM-SRs Storage.
An example of the usage of this application could be explained on the template for Computer-Aided Detection (CAD) Mammography Structured Reporting (SR)
390
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
published in the supplement 50 of ACR-NEMA. It considers four basic sections (TID4015-Detections Performed, TID4016-Analysis Performed, TID4001-Overall Impression and TID4020-Image Library). In the “Overall Impression” section, each important fact is registered (TID4009-Individual Calcification, TID4010-Cluster of calcifications, TID4011-Density, etc.). So, given a virtual repository of mammograms in which the tags TD4009, TID4010 and TID4011 are included in the ontology criteria, it will be possible to create a virtual storage and search using them, for example, “all the images with calcifications of a density above a defined threshold”, regardless if they refer to a cluster or a single calcification. Images from the SRs matching the searching can be brought up outlining the Regions of Interests denoted by the tags selected.
7. Performance and Results Some results have been published in [21]. These results cover the creation of a C_GRID_SR_Storage and a searching of DICOM objects in the instances of these components. The main conclusion of this publication according to the searching process is that the searching time only depends on the database back-ends, the size of the results and the bandwidth of the network. Queries are performed in parallel. Obviously, as the number of images grows, a searching time increases, but the architecture does not introduce an important additional overhead since global management is very lightweight. The system is scalable since the centralised information is minimal. If a new Structure Report Storage is deployed in the system, Index Grid Service and Storage Broker Grid Services will update their catalogues for each VO, requesting the new repository to inform the centralised services about the availability of studies matching the active virtual storages. Searching time is bounded by the slowest Storage Broker Grid Service. The inclusion of a new study on a repository containing studies that already match the criteria will not produce any change on the central services. Transferring of data is performed directly from distributed repositories to the clients, thus avoiding bottlenecks.
8. Conclusions and Future Work This paper describes the work that has been performed in the development of an ontology-based indexation of a virtual database of DICOM objects using the contents of Structured Reports. This work has been integrated in the frame of the TRENCADIS architecture, which is being developed for the deployment of a cyber-infrastructure focusing oncology and medical imaging. This cyber-infrastructure will link 7 public and private hospitals of the Land of Valencia. The Middleware developed constitutes an efficient framework that manages DICOM-SR objects using ontologies. Since it offers a high-level object-oriented interface, it increases the productivity of code developers for building applications for managing DICOM-SR. Application developers can rely on the virtual components offered by TRENCADIS independently of the final implementations. New services and components are being developed for image postprocessing.
I. Blanquer et al. / TRENCADIS – A WSRF Grid MiddleWare
391
The components have been developed in Java but migration to others languages (C++, etc...) is feasible, since the protocols used in the architecture defined are standard and are supported by the most languages systems. The work is being extended on three main areas. First, work is being performed with medical users for the implementation of a DICOM-SR template that will fit the medical cases to be used on the cyberinfrastructure. Secondly, work on the improvement of security and privacy is being developed with the integration of the encryption and policy management environment already developed in other activities. Finally, post-processing services for co-registration, volume rendering and fMRI will be included.
References [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
[12] [13]
[14] [15] [16] [17] [18] [19]
[20] [21] [22]
[23]
Digital Imaging and Communications in Medicine (DICOM) Part 10: Media Storage and File Format for Media Interchange. National Electrical Manufacturers Association, 1300 N. 17th Street, Rosslyn, Virginia 22209 USA. “DICOM Structured Reporting”. Dr. David A. Clunie. ISBN 0-9701369-0-0. 394 pages softcover. “European DataGrid Project”. www.eu-datagrid.org. “BIRN Biomedical Informatics Research Networks” www.nbirn.net “ACI project MEDIGRID: Medical Data Storage and Processing on the GRID” http://www.creatis.insalyon.fr/MEDIGRID. “Open Grid Services Architecture (OGSA)”, http://www.globus.org/ogsa Foster I., Kesselman C., and Tuecke S.. “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”. International Journal of Supercomputer Applications, 15(3), 2001. “Global Grid Forum (GGF)”. www.gridforum.org. “The WS-Resource Framework“. www.globus.org/wsrf “Organization for the Advancement of Structured Information Standards (OASIS)”. http://www.oasisopen.org/home/index.php. U. S. Department of Health and Human Services, Public Health Service Health Care Financing Administration. “ICD-9-CM: International Classification of Diseases”, 9th Revision, Clinical Modification, Fourth Edition, Volumes 1,2, and 3: Official Authorized Addendum, Effective October 1, 1992. “Coding Clinic for ICD-9-CM 9” , no. Special Edition (1992): 1-63. “SNOMED International”. http://www.snomed.org McDonald, C. J., Huff, S. M., Suico, J. G., Hill, G., Leavelle, D., Aller, R., Forrey, A., Mercer, K., DeMoor, G., Hook, J., Williams, W., Case, J., & Maloney, P. (2003). “LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-year Update”. Clin Chem, 49(4), 624-633. Dolin RH, Alschuler L, Boyer S, Beebe C, Behlen FM, Biron PV, Shabo A. “HL7 Clinical Document Architecture”, Release 2. J Am Med Inform Assoc. 2005 Oct 12. Allen Wyke R., Watt A., “XML Schema Essentials”. Wiley Computer Pub. ISBN 0-471-412597 “Web Services Description Language (WSDL) 1.1. W3C Note 15 March 2001.” http://www.w3.org/TR/2001/NOTE-wsdl-20010315. “The GridFTP Protocol and Software”. http://wwwfp.globus.org/datagrid/gridftp.html. “SOAP Version 1.2.”. http://www.w3.org/TR/soap. Erik Torres, “Protección de la Privacidad en Entornos Grid en el Ámbito de la Salud: Tratamiento Seguro de Datos Distribuidos Mediante Encriptación y Gestión Compartida de las Claves”. Research Report, DSIC, Universidad Politécnica de Valencia, 2005. Rolf Oppliger . “Security technologies for the World Wide Web”. Second edition. Computer Security Series. Artech House. ISBN: 1580533485 I. Blanquer, V. Hernández, D. Segrelles , “An OGSA Middleware for Managing Medical Images using Ontologies”, Journal of Computing on Clinical Monitoring, ISSN: 1387-1307. C. de Alfonso, I. Blanquer, V. Hernández, D. Segrelles , Web-Based Application Service Provision Architecture for Enabling High-Performance Image Processing”, Lecture Notes in Computer Science ISBN: 3-540-25424-2. ISSN: 0302-9743. I. Blanquer; V. Hernández; D. Segrelles, F. Mas, “Creating Virtual Storages and Searching DICOM Medical Images through a Grid Middleware based in OGSA”, Biogrid, ISBN 0-7803-9075-X.
392
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
GATE Simulation for Medical Physics with Genius Web Portal C.O. Thiam a, L.Maigne a, V. Breton a, D. Donnarieix a,b, R. Barbera c, A. Falzone c a
Laboratoire de Physique Corpusculaire, 24 avenue des Landais, 63177 Aubière cedex, France b Unité de Physique Médicale, Département de Radiothérapie-Curiethérapie, Centre Jean Perrin, BP 392 63011, Clermont-Ferrand, cedex, France c Dipart. di Fisica e Astronomia, INFN, and ALICE Collaboration, via S. Sofia 64, I-95 123 Catania, Italy 1. Introduction PCSV team of the LPC laboratory in Clermont-Ferrand is involved in the deployment of biomedical applications on the grid architecture. One of these applications deals with the deployment of GATE (Geant4 Application for Tomographic Emission) for medical physics application. The aim of the developments actually performed is to enable the usage of the GATE platform in clinical routine. However, this perspective is only possible if the computing time is highly reduced as Monte Carlo GATE simulations applied to doses calculation in radiotherapy per example require several hours of calculations in order to achieve a precise result. The new grid architecture, developed within the framework of the European project Enabling Grid for E-sciencE (EGEE)[1] is there to answer this requirement. The use of the grid resources must be transparent easy and rapid for the medical physicists. For this purpose, we adapted the GENIUS web portal in order to facilitate the GATE simulations planning on the grid. The GENIUS development project started in 2002 was initiated by the Italian INFN Grid Project [4]. Its goal was to create a web portal that would overcome all the difficulties related to the complex command lines interfaces (CLI) and the job description language (JDL) that a user encounters when he has to submit his application in a grid environment. We describe first the architecture and implementation of the portal, the services offered and to finish the work realized (since April 2004) to customize the portal in order to allow transparent GATE jobs submission and management on the EGEE-LCG2 infrastructure. We present some results on the computing time demonstrating the impact of grid resources for the optimization of GATE. 2. GATE: Monte Carlo simulation GATE is a generic Monte Carlo platform based on GEANT4 (current version 4.7.0.p01) [2]. Specific modules are added on top of GEANT4 for (Single Photon Emission Computed Tomography) and PET (Positron Emission Tomography) requirements and facilitate the usage of the code. GATE can also be applied to model brachytherapy-radiotherapy applications. In this paper GATE platform is used to calculate the relevant dosimetric quantities for treatment planning in brachytherapyradiotherapy.
C.O. Thiam et al. / GATE Simulation for Medical Physics with Genius Web Portal
393
3. The parallelization method Each MC simulation uses a sequence of random numbers to generate the physical interactions in matter. The more numerous the interactions in a medium are, the longer the sequence of random numbers generated for the simulation is. A simple way to reduce the execution time of a simulation used in physical experiments is to sub-divide a long or a very long simulation into little ones by indexing to each simulation a subsequence of random numbers obtained by partitioning a long sequence of random numbers. Sub-sequences have to be independent; this method is valid only because the particles emitted in simulations are independent. An obvious way to get parallel random numbers streams is to partition a sequence of a given generator into suitable independent sub-sequences, such as the Sequence Splitting Method [3].
4. GENIUS and GENIUSphere: 4.1. Description The GENIUS [5] web portal is carried out on the top of the middleware services of the EGEE infrastructure. The layout of the portal can be described by a three-tier architecture model (figure 1). On the user workstation, a web browser running Internet Explorer and Mozilla is used to access to the EGEE grid infrastructure through the User Interface. The EGEE User interface (UI) machine runs the Apache web server, the Java/XML framework EngineFrame developed by NICE and GENIUS itself. From the User interface via the portal the user has access to the remote Grid resources.
Figure 1. Three-tier architecture of the GENIUS portal.
Using those services, the user can interact with files on the UI and from there the user can launch jobs to the grid and manage the data of the Virtual Organisation to which he belongs. In order to guarantee a secured access to the grid, GENIUS has been implemented with a multi-layered security infrastructure: • • •
All web transactions are executed under the Secure Socket Layer (SSL) via HTTPS The user has to obtain an account on a UI machine of the grid and be part of a VO. At each time a user wants to interact with files on the UI machine, he is prompted for the username and password he obtained on that machine.
394
C.O. Thiam et al. / GATE Simulation for Medical Physics with Genius Web Portal
To increase the security, usernames and passwords typed on the browser are not saved anywhere; they are streamed under the https protocol and then destroyed. The GENIUS portal logs out automatically if no actions have been done on the web browser for more than 30 minutes to avoid undesired accesses (Figure 2). As the user is authenticated and authorized to use the portal, he can have access to more than 100 functionalities. All the GENIUS services are described in the User’s manual that can be browsed from the official site.
Figure 2. Screenshot of the web portal Genius.
Figure 3. GENIUS new version and GENIUSphere.
A new version of GENIUS (Figure 3) is under validation to bring a larger flexibility in the use of grid resources by the web portal. This GENIUS version benefiting services of new technologies of data-processing programming, gives a large improvement the ways to write the code for grid integrations. The grid machines will be able to communicate easily and the files browsing for GATE application will be able to transparently with an easy files downloading. 4.2. Encoded functionalites for GATE applications on the Genius web portal In order to enable a transparent and interactive use of GATE applications on the grid, we have developed, in joint collaboration with the department of Physics and Astronomy of the Catania University, all the functionalities to run GATE simulations on distributed resources (Figure 4). Those developments are intended to researchers of the GATE collaboration willing to have access to large computing resources to run long simulations but has been designed also to answer the needs of physician and medical physicists in a clinical structure (typically the Centre Jean Perrin) to compute dosimetric studies using Monte Carlo high precision for specific applications for instance ocular brachytherapy treatments using ophthalmic applicators of radioactive sources. We will explain in the following the current developments done on the GENIUS portal and GENIUSphere to enable a transparent and convenient access to the grid for GATE applications. Many informatic languages are used to implement functionalities on the portal: •
• •
Bash coded files enable to define and call all the command lines interface needed for the application, the creation of files such like scripts, jdls and the interactions with the grid (submission, monitoring) XML coded files enable the creation of the buttons and functionalities appearing on the portal screen. JAVA scripts allow interfacing other application automatically.
395
C.O. Thiam et al. / GATE Simulation for Medical Physics with Genius Web Portal
Figure 4. Screenshot of the web portal Genius related to the creation of GATE files.
The GATE service on the web portal is made of two parts essentially: “Jobs Services” for submission and monitoring jobs and “Data Services” for registration and management of medical data files. In the “Jobs Services” part, the Job Settings folder provides to the user a web page enabling the creation of the GATE files (Figure 4), their removal from the UI and the creation of the jdl files (Job Description Language) to launch them on the grid. From this Web page, user also uploads all the information reliant to the GATE macro, and to fix a number of partitions for its job. Once that the jobs are submitted, you have the possibility to monitored your jobs by checking their status. At the end of the execution, when all jobs are done, spooler directory is created with the outputs of the jobs available in it.
5. Computing times test GATE being available on several EGEE sites, with a combined capacity of 3000 CPUs approximately, we carried out tests in order to estimate gains in computing times. The tests were done using a radiotherapy treatment simulation generating several million events (20.000.000 particles). This simulation is divided successively into series of ten, twenty, fifty and hundred jobs which are then deployed on the grid. We evaluated time spent in different states from the moment the job is submitted to the output retrieval. 1200 Local 10 20 50 100
Local : Intel® Xeon™ CPU 3 06GH
Time (minutes)
1000
800
600
400
200
0 Local
10
20
50
100
Numbers of Jobs
Figure 5. Times comparison in relation to jobs states.
Figure 6. Comparison between local time and the grid computing time.
The values represented on figure 5, describe the job states during its life on the grid. Once that the request is addressed to the Resource Broker (RB) job state is “submitted”. Job state is “waiting” on standby while RB questions the information
396
C.O. Thiam et al. / GATE Simulation for Medical Physics with Genius Web Portal
system to find Computing Element (CE) adapted to accommodate your job. Job state is “Ready” if the request is prepared to be subjected to Computing Element. Job state is “Scheduled” if is a batch queue on the Computing Element. Job state is “Running” while it is executed. Job state is “Done” if the execution is finished successfully and preceded well. Result can be repatriated with the user. Time corresponding to Scheduled state can be important (Figure 5) and depends directly on the load of Computing Element load and the computing performance of the nodes. The other times are almost negligible compared to the total time a job spends on the grid. Figure 6 represents a comparison of computing time on a local machine compared to times obtained for each series in the test. Within the framework of our application, we obtain a very significant result splitting a simulation GATE on the grid architecture with a crunching factor which can reach 20 (Figure 6). The computing time is not necessarily proportional to the number of jobs running in parallel. This is in part due to three parameters like launching time, numbers of free CPUs, load and performance of Computing Element. It is important to take into account of the parameters. Also, submission and retrieval times can be very important using a large sequential submission or multithreaded submission.
6. Conclusion GENIUS web portal will allow in a long term to use GATE application in brachytherapy and radiotherapy treatment planning using medical data (medical images, DICOM, binary data dose calculation in the heterogeneous mediums) and to analyze the results obtained in visual form. Other functionalities are under development and will make possible and easy to register medical data on grid storages elements and to manage them. Work carried out so as to split the simulation reduces considerably the computing time. The figures obtained are very good know in a that the clusters used for these tests are significantly loaded. The computing grid gives promising results and meets a definite need: reach acceptable computing time for a future use of Monte Carlo simulations for treatment planning in brachytherapy and radiotherapy. Our GATE activities for dosimetry application entered in to direct phase of evaluation by the cancer treatment center of Clermont Ferrand (Centre Jean Perrin).A work station is currently available in this center to test the use of GATE application on the grid through GENIUS and GENIUSphere.
References [1] EGEE Project: http://public.eu-egee.org/. [2] GATE: “a simulation toolkit for PET and SPECT”, Jan S., Santin G., Strul D., Staelens S. and al, Phys. Med. Biol., 49 (2004) 4543–4561. [3] Parallelization of Monte Carlo Simulations and Submission to a Grid Environment, L. Maigne, D. Hill, P. Calvat, V. Breton, R. Reuillon, D.Lazaro, Y. Legré, D. Donnarieix, Parallel Processing Letters journal, Vol. 14, No. 2 (June 2004), p. 177–196. [4] GENIUS: “a simple and easy way to access computational and data grids”, A. Adronico, R. Barbera, A. Falzone, P. Kunszt, G. Lo Re, A. Pulvirenti, A. Rodolico, Future Generation Computer Systems, volume 19, Issue 6, pages 805–813, 2003. [5] Visit the GENIUS web site: http://genius.ct.infn.it.
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
397
Biomedical Aplications in EELA1 Miguel Cardenas a , Vicente Hernández b , Rafael Mayo b , Ignacio Blanquer b , Javier Perez-Griffo a , Raul Isea c , Luis Nuñez c , Henry Ricardo Mora d and Manuel Fernández d a Extremadura Advanced Research Center (CIEMAT) b Universidad Politécnica de Valencia, ITACA - GRyCAP c Universidad de los Andes d Cubaenergía Abstract. The current demand for Grid Infrastructures to bring collabarating groups between Latina America and Europe has created the EELA proyect. This e-infrastructure is used by Biomedical groups in Latina America and Europe for the studies of ocnological analisis, neglected diseases, sequence alignments and computation plygonetics.
1. Introduction Funded by the European Commission, the EELA Project ("E-Infrastructure shared between Europe and Latin America") builds a digital bridge between the existing eInfrastructure initiatives that are in process of consolidation in Europe (in the framework of the European EGEE Project), and those that are emerging in Latin America, throughout the creation of a collaborative network that share an interoperable Grid infrastructure to support the development and test of advanced applications. EELA aims to position the Latin American countries at the same level of the European developments in terms of e-Infrastructures. Now that the network infrastructure in Latin America is stable, the EELA focus will be in the Grid infrastructure and in some related e-Science applications. Therefore, the project’s participant institutions have identified two fundamental scopes: the creation of a human network in e-Science - valuing its necessities and giving training to it - and the conduction of the technological developments that will allow Grid development and operation in the region. Establishing a collaborative network of research scientist and people This document describes the biomedicine applications which are currently deployed on the pilot EELA infrastructures for both production and dissemination purposes. 2. GATE (Geant4 application for Tomographic emission) GATE (Geant4 application for Tomographic emission) is a C++ platform based on the Monte Carlo Geant4 software. It has been typically designed to model nuclear medicine 1 E-Infrastructure
Shared Between Europe and Latin America: http://www.eu-eela.org/
398
M. Cardenas et al. / Biomedical Applications in EELA
applications, such as PET and SPECT within the OpenGATE collaboration[5]. Its functionalities are combined to its ease of use make this platform also adequate for radiotherapy and brachytherapy treatment planning. However, Monte Carlo simulations are computational-intensive, preventing hospitals and clinical centres from using them for daily practice. As a result, the objective of GATE is to use the Grid environment to reduce the computing time of Monte Carlo simulations in order to provide a higher accuracy in a reasonable period of time. Nine Cuban centres are currently testing, as users the results of the simulation of radiotherapy treatments using realistic models that GATE provides. The interest of this community is around two main oncological problems: • Thyroid Cancer is endemic in many areas of Cuba[1], being the diseases of thyroids one of the 5 main causes of Endocrinology treatments. • Treatment of metastasis with P-32[2]. Brachytherapy using P32 isotopes is a procedure which is revealing very good results in Cuba. Improving the knowledge on the doses captured from the different tissues by accurate simulation is a key issue. The main benefit of using the Grid is that it has enable medical users to access realistic Monte-Carlo simulation for their research in radiotherapy planning. Without the EELA Grid, medical users are not provided of enough computational resources to deal with the large requirements that this processing has. Grid will increase the performance to the application, but in this case it will even be more important, since it is an enabling technology opening the doors to a new range of applications. All the centres from Cuba currently testing this application bring to the EELA community around 90 cases per month.
3. Wide in Silico Docking of Malaria (WISDOM) The objective of WISDOM is the proposition of new inhibitors for a family of proteins produced by Plasmodium Falciparum. This protozoan parasite causes malaria and affects around 300 million people and more than 4 thousand people die daily in the world. Drug resistance has emerged for all classes of antimalarials except artemisinins. The main reason is that the available drugs focus on a limited number of biological targets, producing a cross-resistance to antimalarials. There is a consensus that substantial scientific effort is needed to identify new targets for antimalarials. The main problem is that the development of new drugs with new targets is a costly and lengthy process, and the economic profit is not clear for the drug manufacturers. This application consists on the deployment of a high throughput virtual screening platform in the perspective of in silico drug discovery for neglected diseases. The WISDOM2 platform performs a High-Throughput virtual Docking of million of chemical compounds available in the databases of ligands to several targets of Plasmepsin[6]. The interest of the EELA partners is basically centred in three actions: • The study of new targets for new parasitory diseases. Such as Dengue as this endemic affects various countries in LA and fight against it has been very strong. 2 This
platform is being jointly developed by the SIMDAT, SwissBioGRI, Swiss Institute of Bioinformatics, the INSTRUIRE regional grid in Auvergne and the CampusGRID
M. Cardenas et al. / Biomedical Applications in EELA
399
Notwithstanding that several regions in LA communities are free of dengue, many other suffer from periodic epidemics. These targets will be added to the ones selected in the Data Challenge to maximise the exploitation of the resources. • The selection of new targets for the malaria. This include new research lines on drug identification different from the ones selected in the WISDOM studies, which is more interesting for the LA communities, such as the DHFR protein and their chlorate derivatives. • The contribution with resources for the WISDOM Data Challenge. 4. Basic Local Alignment Searching (BLAST) One of the most important efforts on the analysis of the genome is the study of the functionality of the different genes and regions. Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes. Both functional and evolutionary information can be inferred from well designed queries and alignments. The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. This process of finding homologous of sequences is a very computationally-intensive process. The size of the databases currently available (Non-redundant Gene Bank, SwissProt, etc.) increases daily, reaching the size of more than a gigabyte. Searching alignment of a single sequence is not a costly task, but normally, thousands of sequences are searched at the same time. The biocomputing community usually relies on either local installations or public servers, such as the NCBI or the gPS@, but the limitations on the number of simultaneous queries make this environment inefficient for large tests. Moreover, since the databases are periodically updated, it will be convenient to periodically update the results of previous studies. Thus, the availability of an independent Grid-enabled version integrated on the Bioinformatics Portal of the Universidad de los Andes will provide registered users with results in a shorter time. A Grid service for the execution of BLAST on large sets of sequences has been developed at the UPV[3]. BLAST is being used for searching similar sequences and inferring their function in parasite disesases presented in Venezuela, such as the Leishmaniasis (mainly Mexican Leishmania), Chagas (mainly Trypanosoma Cruzi) and Malaria (mainly Plasmodium vivax). The use of Grids will enable increasing the number of fragments to be analysed and the periodical updating of this information. Moreover, the availability of larger scale computation will enable the researchers on performing evolutionary studies. The increase on the capabilities on performance of the bioinformatics portal of the Universidad de los Andes through the use of the EELA Grid has brought more mature users who access this portal to use more powerful computational resources than those available on their centres.
5. Phylogeny (MrBayes) A phylogeny is a reconstruction of the evolutionary history of a group of organisms. Phylogenies are used throughout the life sciences, as they offer a structure around which
400
M. Cardenas et al. / Biomedical Applications in EELA
to organize the knowledge and data accumulated by researchers. Computational phylogenetics has been a rich area for algorithm design over the last 15 years. The inference of phylogenies with computational methods is widely used in medical and biological research and has many important applications, such as gene function prediction, drug discovery and conservation biology [4]. The most commonly used methods to infer phylogenies include cladistics, phenetics, maximum likelihood, and MCMC-based Bayesian inference. These last two depend upon a mathematical model describing the evolution of characters observed in the species included, and are usually used for molecular phylogeny where the characters are aligned nucleotide or amino acid sequences. The complexity of large-scale phylogeny studies, such as the "Tree of Life" project, aiming at all organisms on the Earth, represents a true computational grand challenge. Due to the nature of Bayesian inference, the simulation can be prone to entrapment in local optima. To overcome local optima and achieve better estimation, the MrBayes program has to run for millions of iterations (generations) which require a large amount of computation time. For multiple sessions with different models or parameters, it will take a very long time before the results can be analyzed and summarized. The phylogenetic tools are widely demanded by the Latin America bioinformatics community. A Grid service for the parallelised version of MrBayes application is currently being developed and a simple interface will be deployed on the bioinformatics portal of Universidad de los Andes. This Grid-enabled service will make use of EELA resources to run phylogenetic studies at high performance. 6. Conclusion The EELA e-infrastructure permits various collaborative groups in Latin America a more powerful computational resources than those available on their centers. This achieves that the lines of investigation stated in the document can be feasible as the computational requirements for them can be met. As well of creating a network where Latin American researches can participate in European initiatives and vice versa reducing the digital gap. The EELA project currently has four biomedical pilot applications running, two from European initiatives (EGEE) and other coming from Latin American research groups. This project even though its in early phases will intend to bring more research lines into its e-infrastructure as well as creating a bigger network of collaborative partners. References [1] D. Navarro "Epidemiología de las enfermedades del tiroides en Cuba", Rev Cubana Endocrinol 2004;15. [2] J. Alert, J. Jiménez, "Tendencias del tratamiento radiante en los tumores del sistema nervioso central", Rev Cubana Med 2004; 43(2-3). [3] Gabriel Aparicio, Stefan Götz, Ana Conesa, J. Damian Segrelles Quilis, Ignacio Blanquer, J. Miguel García, Vicente Hernández, "Blast2GO goes Grid: Developing a Grid-Enabled Prototype for Functional Genomics Analysis", Proceedings of the HealthGrid 2006 Conference. [4] K. Lesheng , "Phylogenetic Inference Using Parallel Version of MrBayes" [5] S. Jan, G. Santin, D. Strul, S. Staelens, "GATE: a simulation toolkit for PET and SPECT", submitted to Phys. Med. Biol. [6] V. Breton, "Grid added value to fight neglected diseases", Wisdom Open Day
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
401
Outlook for Grid Service technologies within the @neurIST eHealth environment A. ARBONAa, S. BENKNERb, J. FINGBERGc, A.F. FRANGId, M. HOFMANNe, D.R. HOSEd, G. LONSDALEc,1, D. RUEFENACHTg and M. VICECONTIh a
Grid Systems S.A., Palma de Mallorca, Spain Institute of Scientific Computing, University of Vienna, Austria c C&C Research Laboratories, NEC Europe Ltd., St. Augustin, Germany d Computational Imaging Lab., Dept. of Technology, Pompeu Fabra University, Barcelona, Spain e Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), St. Augustin, Germany f Dept. Medical Physics and Clinical Engineering, U. Sheffield, UK g Neuroradiology - HUG, University Hospital of Geneva, Switzerland h Bio Computing Competence Centre, Supercomputing Solution s.r.l., Milan, Italy b
Abstract. The aim of the @neurIST project is to create an IT infrastructure for the management of all processes linked to research, diagnosis and treatment development for complex and multi-factorial diseases. The IT infrastructure will be developed for one such disease, cerebral aneurysm and subarachnoid haemorrhage, but its core technologies will be transferable to meet the needs of other medical areas. Since the IT infrastructure for @neurIST will need to encompass data repositories, computational analysis services and information systems handling multi-scale, multi-modal information at distributed sites, the natural basis for the IT infrastructure is a Grid Service middleware. The project will adopt a service-oriented architecture because it aims to provide a system addressing the needs of medical researchers, clinicians and health care specialists (and their IT providers/systems) and medical supplier/consulting industries. Keyword. cerebral aneurysm, integrated medical IT infrastructure, Grid Service middleware
Introduction The IST Framework 6 project @neurIST (www.aneurist.org; included within the activities of the strategic objective, “ICT for Health”) commenced in January 2006. The aim of the project is to create an IT infrastructure for the management of all processes linked to research, diagnosis and treatment development for complex and multifactorial diseases. The IT infrastructure will be developed for one such disease, cerebral 1
Corresponding author: C&C Research Laboratories, NEC Europe Ltd. D-53757 St. Augustin, Germany; Email: lonsdale@ccrl-nece.de
402
A. Arbona et al. / Outlook for Grid Service Technologies
aneurysm and subarachnoid haemorrhage, but its core technologies will be transferable to meet the needs of other medical areas. Indeed, the view is that @neurIST will provided a template for such multi-factorial eHealth systems. Since the IT infrastructure for @neurIST will need to encompass data repositories, computational analysis services and information systems handling multi-scale, multi-modal information at distributed sites, the natural basis for the IT infrastructure is a Grid Service middleware. The project will adopt a service-oriented architecture because it aims to provide a system addressing the needs of medical researchers, clinicians and health care specialists (and their IT providers/systems) and medical supplier/consulting industries. That general architectural approach will ensure that key issues like access control, security, quality-of-service guarantees and integration into commercially operated environments can be addressed fully. The Grid service middleware will be created in line with developing Grid and Web Service standards and will benefit from the existing GEMSS (www.gemss.de ), SIMDAT (www.simdat.org ), and InnerGrid developments. This paper introduces the background to the specific requirements of the medical processes linked to the context of cerebral aneurysms and explains the project objectives, linked to the development of the @neurIST application suites (illustrated in Figure 1).
Figure 1. The impact of the @neurIST Application Suites.
1. Project Background The @neurIST Project (whose full title is “Integration of Biomedical Information for the Management of Cerebral Aneurysms: @neurIST”) is an Integrated Project funded within the European Commission’s Framework 6 programme (contract IST027703) as a contribution to the Srategic Objective for ICT for Health (also referred to as “eHealth”). The project commenced in January 2006 and has a planned duration of 4 years.
A. Arbona et al. / Outlook for Grid Service Technologies
403
@neurIST will develop an IT infrastructure for the management and processing of heterogeneous data associated with the diagnosis and treatment of cerebral aneurysm and subarachnoid haemorrhage. The data spans all length scales, from molecular, through cellular to tissue, organ and patient representations. Such data is increasingly heterogeneous in form, including textual, image and other symbolic structures, and is also diverse in context, from global guidelines based on the broadest epidemiological studies, through knowledge gained from disease-specific scientific studies, to patientspecific data from electronic health records. New methods are required to manage, integrate and interrogate the breadth of data and to present it in a form that is accessible to the end user.
2. @neurIST Goals @neurIST seeks to provide channels for the integration of all data sources on cerebral aneurysm. It has three work packages dedicated respectively to the collection, processing and integration of these data sources. It has one work package dedicated to the development of four integrated exploitation suites of software for clinical and industrial use. Two platforms will be developed that will exploit directly the IT infrastructure and will provide immediate application to other disease processes. Work packages dedicated to management, dissemination and exploitation planning complete the project structure. The primary theme of @neurIST is to develop vertical integration across data structures and across length scales, but horizontal integration at every level of abstraction, from access to information sources, to complex information processing, knowledge representation, structuring and fusion will cement the collaboration between the disciplines. @neurIST will transform the management of cerebral aneurysm by providing new insight, personalized risk assessment and methods for the design of improved medical devices and treatment protocols. It is our belief that this level of integration is extraordinarily ambitious and that only a focused effort on a single disease process can credibly address the challenge. Nevertheless the economic and societal impact in Europe and beyond will be significant. Personalized risk assessment alone could reduce unnecessary treatment by 50% or more, with concomitant savings estimated in the order of thousands million of Euros per annum. The personal effect of aneurysm rupture is devastating: morbidity and mortality are high, affecting two thirds of afflicted patients. Furthermore the approach will be extendable to other disease processes and scalable to federate a large number of clinical centres and public databases
3. @neurIST’s Scientific and Technological Objectives The @neurIST project will: •
Develop a new procedure and IT-support system for the understanding and management of cerebral aneurysm through the integration of heterogeneous data and computing resources.
404
A. Arbona et al. / Outlook for Grid Service Technologies
•
•
• •
•
Identify and collect all publicly available, relevant and strategically important data from scientific studies (multidisciplinary and on multiple length scales, genomic through organ to patient), from international epidemiological studies and from local clinical databases. This data will be supplemented by the collection of additional specific and necessary information. The Consortium will respect all data protection legislation and will obtain local ethical approval for the use of any and all data. Deliver a multi-scale complex information processing chain (image processing – mechano-biological analysis – haemodynamics analysis/prediction – coupling to molecular, cellular, biological, and systemic levels) that will provide new diagnostic indexes and insight into the process of aneurysm development and rupture. That tool chain will generate new data to complement the observational data. Develop a set of scalable and reusable integrative suites and demonstrate their value for revolutionizing the understanding and management of cerebral aneurysm. Provide an ICT-system for developing, integrating and sharing biomedical knowledge related to cerebral aneurysm as required by the four integrative suites. The @neurIST Service Oriented Architecture will build on state-of-theart stable Grid and Web Services technology. A design focus will be placed on interoperability through the use of standards, leveraging on existing technologies where possible while adding new functionality where necessary (i.e. security, semantic data mediation). The @neurIST infrastructure will not only support computationally demanding tasks such as complex modelling and simulation but also enable access to health data distributed in public as well as protected databases distributed all over the world. Inspire and promote the development of corresponding systems for other disease processes by demonstrating the personal and economic impact of ITenabled information integration in the context of cerebral aneurysm. This will be primarily carried out through the stimulating vision and influence of @neurIST in several of the participating partners, including SMEs and industrial partners, whose field of research or market sector outreaches that specifically addressed in this project and who will seek for other application areas or markets to exploit the concepts and platforms developed within @neurIST.
The @neurIST project will thus create a comprehensive IT-system to provide a new way of understanding and managing cerebral aneurysm, based on the integration of information and tools across scientific and organizational boundaries. @neurIST seeks to improve facilities for both research development and individual patient care by providing complex information extraction, processing and inferential deduction tools integrated into the patient database.
405
Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Author Index Ainsworth, J. Ajayi, O. Analyti, A. Aparicio, G. Arbona, A. Bagnasco, S. Barbera, R. Barbounakis, I. Barillot, C. Bath, P. Belleman, R.G. Bellet, F. Beltrame, F. Benali, H. Benkner, S. Benoit-Cattin, H. Blanchet, C. Blanquer, I. Boniface, M. Boucelma, O. Breton, V. Buchan, I. Bucur, A. Burger, A. Canesi, B. Carbonell, J. Cardenas, M. Castiglioni, I. Celda, B. Cerbioni, K. Cerello, P. Chen, Y. Cheran, S.C. Chouvarda, I. Clarke, W. Colonna, F.-M. Comaniciu, D. Combet, C. Conesa, A. Date, S. de Alfonso, C.
348 117 247 194 401 69 392 205 3 336 43 34 69 3 401 34 142, 187 82, 131, 194, 319, 381, 397 225 368 155, 319, 392 348 55 167 69 82 397 69 82 205 69 336 69 236 336 368 259 187 194 358 131
del Frate, C. Deléage, G. Di Bona, S. Dieng-Kuntz, R. Dogac, A. Dojat, M. Donnarieix, D. Eichelberg, M. Falzone, A. Fernández Sánchez, C. Fernández, M. Fingberg, J. Fletcher, J. Fontanelli, R. Fox, N. Frangi, A.F. Freund, J. Frohner, Á. Fujikawa, K. Gaignard, A. Galvez, J. García, J.M. Geddes, J. Germain-Renaud, C. Gibaud, B. Gilardi, M.C. Glatard, T. Goh, C. Gómez Rodríguez, F. Gómez, A. González Castaño, D. González Castaño, F.J. Götz, S. Graschew, G. Guerri, D. Habermann, B. Hajnal, J.V. Hamadicharef, B. Harper, R. Harrison, A. Hartswood, M. Hasegawa, I.
305 142, 187 205 167 225 3 392 225 392 330 397 401 336 205 336 401 259 14 158 3 305 194 336 25 3 69 93 205 330 330 330 330 194 295 205 167 336 205 348 283 336 358
406
Hauer, T. Heinzlreiter, P. Hernández, V. Herveg, J. Hill, D. Ho, K. Hofmann, M. Holeček, T. Hose, D.R. Hu, P. Hung, S.-H. Hung, T.-C. Ifeachor, E. Ikebe, M. Ioannis, Y. Isea, R. Jacq, F. Jacq, N. Jirotka, M. Job, D. Jouvenot, D. Juang, J.-N. Juma, I. Kafetzopoulos, D. Katzarova, M. Kelley, I. Kinkingnéhun, S. Koblitz, B. Kootstra, R. Kostkova, P. Koutkias, V. Krajíček, O. Kranzlmüller, D. Kravchenko, A. Kuba, M. Kunszt, P. La Manna, S. Lawrie, S. Lefort, V. Legré, Y. Lesný, P. Liu, P. Lloyd, S. Lonsdale, G. Loomis, Cal Loomis, Charles Lopez Torres, E.
305 295 82, 131, 194, 319, 381, 397 107 336 336 155, 401 273 401 205 217 217 205 158 259 397 155 155 336 336 14 217 348 247 336 283 3 14 55 167 236 273 295 179 273 14 205 336 187 155, 319 273 259 336 401 14 25 69
Maaß, A. Mackay, C. Maglaveras, N. Maigne, L. Malousi, A. Manganas, A. Manset, D. Marias, K. Martí-Bonmatí, L. Masuda, S. Matsumoto, J.-P. Mayo, R. McClatchey, R. McIntosh, A. McLeish, K. Merelli, I. Mieilica, E. Milanesi, L. Molinari, E. Mollon, R. Monleón, D. Montagnat, J. Mora, H.R. Moratal, D. Morley-Fletcher, E. Mouriño Gallego, J.C. Nederveen, A.J. Nistoreanu, I. Nuñez, L. Nurminen, N. Obbink, H. Odeh, M. Olabarriaga, S.D. Osorio, A. Palanca, E. Palmer, J. Pélégrini-Issac, M. Pena García, J. Pennec, X. Pera, C. Perez-Griffo, J. Perry, D. Pombar Cameán, M. Pongiglione, G. Potamias, G. Power, D. Procter, R. Rakowsky, S.
155 336 236 392 236 247 305 247 82 158 3 397 259, 305 336 336 374 283 374 69 142 82 14, 93 397 82 259 330 43 34 397 205 55 305 43 25 205 336 3 330 93, 259 14, 34 397 336 330 259 247 336 336 295
407
Reichstadt, M. Riposan, A. Robles, M. Rodríguez-Silva, D. Roelofs, T.A. Rogulin, D. Rossor, M. Ruefenacht, D. Russell, D. Saleh, A. Salzemann, J. Sam, Y. Sandercock, P. Santos, N. Schenone, A. Schlag, P.M. Schroeder, M. Schwichtenberg, H. Segrelles, D. Shimojo, S. Simon, E. Simpson, A. Sinnott, R. Slack, R. Snel, J.G. Solomonides, T. Sridhar, M. Starita, A. Stell, A. Stevens, R.
155 283 82, 194 330 295 305 336 401 336 225 155 368 336 14 69 295 167 155 194, 381 358 3 336 117 336 43 305, 319 155 205 117 167
Sun, L. Sunahara, H. Takeda, S. Talon, M. Tashiro, T. Taylor, I. Temal, L. Texier, R. Thiam, C.O. Torres, E. Torterolo, L. Tsiknakis, M. Tverdokhlebov, N. Ure, J. van Leeuwen, J. Varri, A. Vejvalka, J. Viceconti, M. Vinod-Kusam, K. Volkert, J. Voss, A. Wardlaw, J. Warren, R. Watkins, E.R. Watson, G. Zervakis, M. Zhou, X. Zhuchkov, A. Zimmermann, M.
205 158 358 194 358 283 3 25 392 131 69 247 179 336 55 205 273 401 155 295 336 336 305 225 336 205 259 179 155
This page intentionally left blank