FROM GRID TO HEALTHGRID
Studies in Health Technology and Informatics This book series was started in 1990 to promote research conducted under the auspices of the EC programmes Advanced Informatics in Medicine (AIM) and Biomedical and Health Research (BHR), bioengineering branch. A driving aspect of international health informatics is that telecommunication technology, rehabilitative technology, intelligent home technology and many other components are moving together and form one integrated world of information and communication media. The complete series has been accepted in Medline. In the future, the SHTI series will be available online. Series Editors: Dr. J.P. Christensen, Prof. G. de Moor, Prof. A. Hasman, Prof. L. Hunter, Dr. I. Iakovidis, Dr. Z. Kolitsi, Dr. Olivier Le Dour, Dr. Andreas Lymberis, Dr. Peter Niederer, Prof. A. Pedotti, Prof. O. Rienhoff, Prof. F.H. Roger France, Dr. N. Rossing, Prof. N. Saranummi, Dr. E.R. Siegel and Dr. Petra Wilson
Volume 112 Recently published in this series Vol. 111. J.D. Westwood, R.S. Haluck, H.M. Hoffman, G.T. Mogel, R. Phillips, R.A. Robb, K.G. Vosburgh (Eds.), Medicine Meets Virtual Reality 13 Vol. 110. F.H. Roger France, E. De Clercq, G. De Moor and J. van der Lei (Eds.), Health Continuum and Data Exchange in Belgium and in the Netherlands – Proceedings of Medical Informatics Congress (MIC 2004) & 5th Belgian e-Health Conference Vol. 109. E.J.S. Hovenga and J. Mantas (Eds.), Global Health Informatics Education Vol. 108. A. Lymberis and D. de Rossi (Eds.), Wearable eHealth Systems for Personalised Health Management – State of the Art and Future Challenges Vol. 107. M. Fieschi, E. Coiera and Y.-C.J. Li (Eds.), MEDINFO 2004 – Proceedings of the 11th World Congress on Medical Informatics Vol. 106. G. Demiris (Ed.), e-Health: Current Status and Future Trends Vol. 105. M. Duplaga, K. Zieliński and D. Ingram (Eds.), Transformation of Healthcare with Information Technologies Vol. 104. R. Latifi (Ed.), Establishing Telemedicine in Developing Countries: From Inception to Implementation Vol. 103. L. Bos, S. Laxminarayan and A. Marsh (Eds.), Medical and Care Compunetics 1 Vol. 102. D.M. Pisanelli (Ed.), Ontologies in Medicine Vol. 101. K. Kaiser, S. Miksch and S.W. Tu (Eds.), Computer-based Support for Clinical Guidelines and Protocols – Proceedings of the Symposium on Computerized Guidelines and Protocols (CGP 2004) Vol. 100. I. Iakovidis, P. Wilson and J.C. Healy (Eds.), E-Health – Current Situation and Examples of Implemented and Beneficial E-Health Applications ISSN 0926-9630
From Grid to Healthgrid Proceedings of Healthgrid 2005
Edited by
Tony Solomonides and
Richard McClatchey with Vincent Breton, Yannick Legré and Sofie Nørager
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
© 2005 The authors. All rights reserved All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 1-58603-510-X Library of Congress Control Number: 2005923341 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail:
[email protected] Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax: +44 1865 750079
Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail:
[email protected]
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
v
Introduction Healthgrid 2005 was the third annual conference of the Healthgrid association. The first conference, held in January 2003 in Lyon, reflected the need to involve all actors – physicians, scientists and technologists – who might play a part in the application of grid technology to health, whether in health care or in bio-medical research. The second conference, held in Clermont-Ferrand in January 2004, reported research and work in progress from a large number of projects. It was thus the intention of the Programme Committee that the third conference would report on results of current grid projects in health care, would indicate the outcome of field tests and would identify deployment strategies for prototype applications in the field. In the event, all this has been achieved, but other outstanding problem areas and technological challenges have also been identified and new solutions to these issues proposed. It has been the Programme Committee’s aspiration for Healthgrid 2005 to provide a forum for grid projects in the medical, biological and biomedical domains as well as for projects that seek to integrate these through grid computing. With an emphasis on results, it was our hope that functionality being developed within such projects would be demonstrated. The overall objective has been to reinforce and promote the awareness of the deployment of grid technology in health. The response to the call for papers has been gratifying, generating papers across the spectrum of bio/medical informatics from technology developers and researchers through research network representatives, health authorities to healthcare clinicians and administrators. The papers have been divided into four themes: Knowledge and Data Management; Deployment of Grids in Health; Current Projects; and, Ethical, Legal, Social and Security Issues. It will be seen in the papers themselves that healthgrid has matured beyond its original projects and is now tackling some difficult problems that, even two years ago, seemed intractable. In Knowledge and Data Management several authors deal with the important question of heterogeneous sources of information. Ownership and ‘liveness’ of the information are two aspects of the question tackled by the IBHIS project. Budgen et al. consider the problem of semantic interoperability between a broker and its information sources, leading to proposals for the provision of an equivalent grid service. Among domains of interest, proteomics provides the major challenge in bioinformatics grids. In the PROTEUS and ROKKY projects we have two contrasting applications. Cannataro et al. describe the use of twin ontologies to bring data mining methods to bear on the bioinformatics domain in the design of in silico experiments. In their sub-project of the major Japanese BioGrid project, Fujikawa et al. focus on workflow design for high throughput computing to predict protein structure, a notoriously difficult problem. Finally, two ontology-oriented papers from Russia: Irina Strizh and the Moscow State University group (Joutchkov et al.) report on several examples of what they describe as ‘onto-technologies’ in, among other domains, immunology and the development of new medicines. Smirnov et al. describe an application of St Petersburg’s ‘Knowledge Logistics’ technology (KSNet) to provide a knowledge repository in healthgrid with an example battlefield application.
vi
Among reports on Deployment of Grids in Health, the MammoGrid project (Amendolia et al.) reports on its now fully developed grid architecture and migration to EGEE. In the first of two papers from the United States, Binns et al. report on an Access Grid visualization prototype, reflecting both on the system itself and the process of constructing it. The Valencia group (Blanquer et al.) consider how clinical decision support systems may be integrated in a healthgrid; while medical users access the grid through web protocols, they may also search classification engines devoted to the analysis of different diseases. Chouvarda et al. tackle the problem of establishing a biosensor network based on a grid, interfacing to wearable platforms through ‘body area networks’ to establish an infrastructure for ‘pervasive healthcare’. The second US contribution (Grethe et al.) provides insight into the development of the Biomedical Informatics Research Network (BIRN), a model large-scale collaboration with many scientific targets in sight. Current Projects reporting progress include ProGenGrid (Aloisio et al.), an elaboratory for the simulation of biomedical experiments. Exhibiting a service-oriented architecture, it provides workflow and application services, semantics, and generic interfaces to genome, protein, disease and clinical databases. A virtual laboratory for escience (VL-e) is also the goal of a Dutch project from Philips and the University of Amsterdam (Bucur et al.); based on the concept of decomposition patterns, it is exemplified by a Grid Architecture for Medical Applications. The approach is applied to the very challenging ‘fibre tracking’ problem in imaging. JuaN Fritschy (Fritschy et al.) reports on the use of a Condor architecture to address the problems of signal processing in clinical neurophysiology; in its present state of development, the 25-fold improvement in performance will allow researchers to apply their approach to nonlinear ‘electrical impedance tomography’ reconstructions. Cardiac MRI image segmentation is the subject of a paper from a Spanish-Dutch collaboration (Ordas et al.). They provide a careful argument for the necessity of grid in tackling a top-down, model-based approach to image segmentation. Ivan De Mitri then reports on the further development of the GP-CALMA project into ‘MAGIC-5’. Asif Jan Muhammad and Henry Markram report on NEOBASE, an approach to mapping the neocortical microcircuit. It need hardly be said that, despite the formidable technical problems grids present, Ethical, Legal, Social and Security Issues give rise to some of the most intractable obstacles to effective healthgrid deployment. The corresponding attention being paid to these problems is, of course, considerable. In this section, we have reports from several projects at different levels of maturity. ARTEMIS (Boniface and Wilken) describes a semantic web service to mediate the security and privacy policies mandated by legal and local regulatory requirements. Two diverse exemplars have been deployed in Belfast, Northern Ireland, and Ankara, Turkey. The GEMSS project is one of the earliest and most widely known multidisciplinary medical projects in Europe. Fenner et al. report on an application in radiosurgery planning and explore the implications of healthcare sector requirements on such aspects as the security and time-criticality of systems. The paper goes on to compare the prototype with industry standard production systems. Hartswood et al. describe in detail the travails of requirements capture and ethical approval in the eDiaMoND project, a UK e-Science project. Requirements capture is too often glossed over in computing projects and it is instructive to see it taken so seriously in this case; interestingly, its EU-funded sister project MammoGrid also dealt with requirements in depth, although very differently (cf. Mike Brady’s talk at DiDaMIC 2004, http://www.irisa.fr/visages/demo/Neurobase/Didamic/index.html).
vii
ECIT (Lawford-Davies and Jenkins) explores the implications of the Human Tissue Directive and the need for systematic and yet highly confidential collection of data from cases of assisted reproduction. The ontological work has only just begun in this area, but a broad collaboration which extends beyond European borders promises to deliver a suitable grid solution, taking into account the full spectrum from classification of infertility, through the genetics of ‘pre-implantation genetic diagnosis’ to the potential complications for mother and child. The μgrid project reports on an authentication and authorization prototype – a problem treated rather lightly in grid literature, despite its sine qua non criticality in healthcare applications. A particular concern for μgrid is medical imaging and archiving. Zhang et al. finally propose a set of confidentiality and privacy procedures through one-to-many pseudonymization which may nevertheless be reversed by a trusted party.
ACKNOWLEDGEMENTS The editors would like to express their gratitude to the Programme Committee and the reviewers in particular: Roberto Amendolia, Brecht Claerhout, Ignacio Blanquer, Vicente Hernandez, Peter Highnam, Chun-Hsi Huang, Ayman Issa, Sharon Lloyd, Johan Montagnat, Mohammed Odeh, and Eugenio Zanon. Each paper was read by at least three reviewers, including the editors. Having followed two markedly European conferences, Healthgrid 2005 was fortunate enough to attract contributions from Japan, Russia and the United States, as well as from European members of the Healthgrid association who have once again taken the opportunity to disseminate the results of EU-funded projects. The editors wish to thank all contributors for their papers and their responsiveness to reviews. Neither the conference nor this volume of proceedings would have been possible without the indefatigable endeavours of Yannick Legré, Sharon Lloyd and Hélène Ruelle. Yannick in particular set exceptionally high standards following his successful organization of Healthgrid 2004 at Clermont Ferrand. The photograph on the cover of this volume is reproduced by permission of the Public Relations department of the University of Oxford; this is gratefully acknowledged. Opinions expressed in these proceedings are those of individual authors and editors, and not necessarily those of their institutions.
Tony Solomonides Bristol, February 2005
viii
Healthgrid 2005 Programme Committee Nicholas Ayache Robert Baud Francesco Beltrame Mike Brady Jose-Maria Carazo Piet De Groen Georges De Moor Roland Eils Mark Ellisman Abbas Farazdel Peter Highnam Chun-Hsi Huang Hing Yan Lee Yu-Chuan Li Simon C. Lin Isabelle Magnin Evangelos Markatos Fernando Martin-Sanchez Carlos Martinez-Riera George-Ioan Mihalas Marc Nyssan Mohammed Odeh Francesco Sicurello Peter Sloot Tony Solomonides Piyawut Srichaikul Tin Wee Tan Manolis Tsiknakis Francis Wray Eugenio Zanon
INRIA, France University Hospital of Geneva, Switzerland University of Genova, Italy University of Oxford, UK (Honorary Chair) Centro Nacional de Biotecnologia-CSIC, Spain Mayo Clinic and Foundation, USA University of Gent, Belgium DKFZ, Germany University of California, San Diego, USA IBM Life Sciences, Co-chair, GGF Life Sciences Grid Dept of Health & Human Services, NIH, Maryland, USA University of Connecticut, USA National Grid Office, Singapore Taipei Medical University, Taiwan Academia Sinica Computing Centre, Taiwan CNRS/STIC, France FORTH, Greece Institute of Health Carlos III, Spain Generalitat Valenciana, Spain University of Timisoara; Secretary of EFMI, Romania Vrije Universiteit Brussel, Belgium University of the West of England, UK Univ. of Milano-Bicocca, Health Dir. Lombardia, Italy University of Amsterdam, The Netherlands University of the West of England, UK (Chair) NECTEC, Thailand National University of Singapore, Singapore FORTH, Greece University of Edinburgh, United Kingdom Ospedale Evangelico, Turin, Italy
ix
Contents Introduction. Acknowledgements Healthgrid 2005 Programme Committee
v viii
Part 1. Knowledge and Data Management Managing Healthcare Information: The Role of the Broker David Budgen, Mark Turner, Ioannis Kotsiopoulos, Fujun Zhu, Michelle Russell, Michael Rigby, Keith Bennett, Pearl Brereton, John Keane and Paul Layzell Using Ontologies in PROTEUS for Modeling Proteomics Data Mining Applications Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, Giuseppe Tradigo and Pierangelo Veltri Applying a Grid Technology to Protein Structure Predictor “ROKKY” Kazutoshi Fujikawa, Wenzhen Jin, Sung-Joon Park, Tadaomi Furuta, Shoji Takada, Hiroshi Arikawa, Susumu Date and Shinji Shimojo Grid-Based Onto-Technologies Provide an Effective Instrument for Biomedical Research Alexei Joutchkov, Nikolay Tverdokhlebov, Irina Strizh, Sergey Arnautov and Sergey Golitsyn Ontology–Based Knowledge Repository Support for Healthgrids Alexander Smirnov, Mikhail Pashkin, Nikolai Chilov and Tatiana Levashova
3
17
27
37
47
Part 2. Deployment of Grids in Health Deployment of a Grid-Based Medical Imaging Application S. Roberto Amendolia, Florida Estrella, Chiara del Frate, Jose Galvez, Wassem Hassan, Tamas Hauer, David Manset, Richard McClatchey, Mohammed Odeh, Dmitry Rogulin, Tony Solomonides and Ruth Warren
59
Developing a Distributed Collaborative Radiological Visualization Application Justin Binns, Fred Dech, Matthew McCrory, Michael E. Papka, Jonathan C. Silverstein and Rick Stevens
70
Clinical Decision Support Systems (CDSS) in GRID Environments Ignacio Blanquer, Vicente Hernández, Damià Segrelles, Montserrat Robles, Juan Miguel García and Javier Vicente Robledo
80
x
Grid-Enabled Biosensor Networks for Pervasive Healthcare I. Chouvarda, V. Koutkias, A. Malousi and N. Maglaveras Biomedical Informatics Research Network: Building a National Collaboratory to Hasten the Derivation of New Understanding and Treatment of Disease Jeffrey S. Grethe, Chaitan Baru, Amarnath Gupta, Mark James, Bertram Ludaescher, Maryann E. Martone, Philip M. Papadopoulos, Steven T. Peltier, Arcot Rajasekar, Simone Santini, Ilya N. Zaslavsky and Mark H. Ellisman
90
100
Part 3. Current Projects ProGenGrid: A Grid-Enabled Platform for Bioinformatics Giovanni Aloisio, Massimo Cafaro, Sandro Fiore and Maria Mirto
113
A Grid Architecture for Medical Applications Anca Bucur, René Kootstra and Robert G. Belleman
127
Applications of GRID in Clinical Neurophysiology and Electrical Impedance Tomography of Brain Function J. Fritschy, L. Horesh, D. Holder and R. Bayford
138
Parametric Optimization of a Model-Based Segmentation Algorithm for Cardiac MR Image Analysis: A Grid-Computing Approach S. Ordas, H.C. van Assen, J. Puente, B.P.F. Lelieveldt and A.F. Frangi
146
The MAGIC-5 Project: Medical Applications on a Grid Infrastructure Connection Ivan De Mitri
157
NEOBASE: Databasing the Neocortical Microcircuit Asif Jan Muhammad and Henry Markram
167
Part 4. Ethical, Legal, Social and Security Issues ARTEMIS: Towards a Secure Interoperability Infrastructure for Healthcare Information Systems Mike Boniface and Paul Wilken Radiosurgery Planning Supported by the GEMSS Grid J.W. Fenner, R.A. Mehrem, V. Ganesan, S. Riley, S.E. Middleton, K. Potter and L. Walton Working IT out in e-Science: Experiences of Requirements Capture in a HealthGrid Project Mark Hartswood, Marina Jirotka, Rob Procter, Roger Slack, Alex Voss and Sharon Lloyd
181 190
198
xi
Legal Issues to Address when Managing Clinical Information across Europe: The ECIT Case Study (www.ECIT.info) James Lawford Davies and Julian Jenkins Authentication and Authorisation Prototype on the µgrid for Medical Data Management Ludwig Seitz, Johan Montagnat, Jean-Marc Pierson, Didier Oriol and Diane Lingrand A Linkable Identity Privacy Algorithm for HealthGrid Ning Zhang, Alan Rector, Iain Buchan, Qi Shi, Dipak Kalra, Jeremy Rogers, Carole Goble, Steve Walker, David Ingram and Peter Singleton
210
222
234
Part 5. The Healthgrid White Paper 1. From Grid to Healthgrid: Prospects and Requirements
253
2. A Compelling Business Case for Healthgrid
265
3. Medical Imaging and Medical Image Processing
270
4. Computational Models of the Human Body
277
5. Grid-Enabled Pharmaceutical R&D: Pharmagrids
283
6. Grids for Epidemiological Studies
287
7. Genomic Medicine and Grid Computing
296
8. Healthgrid Confidentiality and Ethical Issues
306
9. Healthgrid from a Legal Point of View
312
White Paper Contributors
319
Author Index
323
This page intentionally left blank
Part 1 Knowledge and Data Management
This page intentionally left blank
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
3
Managing Healthcare Information: The Role of the Broker David BUDGEN a,1, Mark TURNER a, Ioannis KOTSIOPOULOS b, Fujun ZHU c, Michelle RUSSELL d, Michael RIGBY d, Keith BENNETT e, Pearl BRERETON a, John KEANE b and Paul LAYZELL b a School of Computing & Mathematics, Keele University, Staffordshire, ST5 5BG b School of Informatics, University of Manchester c Department of Computer Science, University of Durham d Centre for Health Planning & Management, Keele University e School of Engineering, University of Durham
Abstract. We describe a prototype information broker that has been developed to address typical healthcare information needs, using web services to obtain data from autonomous, heterogeneous sources. Some key features are reviewed: how data sources are turned into data services; how we enforce a distributed access control policy; and how semantic interoperability is achieved between the broker and its data services. Finally, we discuss the role that such a broker might have in a Grid context, as well as the limitations this reveals in current Grid provision. Keywords. Access control, broker, data services, semantic interoperability
1. Introduction Decisions in healthcare often need to be based upon information that has been drawn from a range of different sources. The care team may need information from hospital departments, primary care doctors and social workers in order to formulate a care plan for an elderly patient; a doctor suspecting physical abuse of a child might wish to check whether there are records of previous visits held by other hospitals or primary care surgeries; paramedics attending to a local household might need to know whether there is relevant information that they should have before proceeding; to give but a few examples. Such decisions will also occur in a context where: • • • •
the information acquired needs to be fully up to date; the relevant information may be ‘owned’ by independent autonomous agencies, that will impose their own conditions upon access and use; the set of relevant sources of information may not be known at the outset; there is no one ‘key’ that can be employed to identify the individual patient across all of the relevant agencies.
Unfortunately, with the current system of individual stand-alone data sources, such decisions are in fact made upon incomplete information. 1
Correspondence to: David Budgen, School of Computing & Mathematics, Keele University, Staffordshire ST5 5BG; E-mail:
[email protected].
4
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
The IBHIS project (Integration Broker for Heterogeneous Information Sources) has been exploring one approach to resolving these issues, which is that of the broker, a trusted intermediary that acts to collect, protect and assemble information from electronic records held across many distributed agencies. Funded by EPSRC’s Distributed Information Management programme, this has been a collaborative project involving the Universities of Durham, Keele and Manchester, exploring both the broker concept and also the use of service-based technologies for its realisation [8,22,25]. The research has addressed many relevant themes, supported through the development of two ‘proof of concept’ prototypes, with the second of these providing a genuinely distributed service-based platforms with which to explore our ideas. As originally conceived, the main challenges for the project lay in addressing the variety and heterogeneity of the data sources, as well as the need to ensure security of access to the data. In this paper, we explain how we addressed these challenges; identify some of the new challenges posed by the form of the second prototype; and consider how it could fit into the wider context presented by Grid technology. The next sections explain more about the broker role and the requirements that this poses; describe the architectural form and technological implementation of our second prototype; and provide a discussion of the experiences from this. We then consider how these might translate into the Grid context, and what issues need to be addressed to make this translation. This may not fit fully with the existing perception of Grid technologies, in that it does not employ distributed processing in the sense of incorporating remote processing functions, but it does address the core Grid objective of harnessing and integrating remote data in real time, while acceding to local systems’ autonomy and operational methods. It does this in a new and novel way, and thus opens up a new paradigm of end-user access to the richness of current and historic data existing, but currently inaccessible, in the wider health and social care domains.
2. The Information Broker 2.1. The Broker Concept The concept of the broker as some agency, human or otherwise, that gathers information and melds this into something composite is quite a familiar one. In travel, both human brokers and electronic ones (such as Travelocity2) enable the customer to make an enquiry about flights on specific dates, with personalised preferences, and where appropriate will assemble a package of travel elements and accommodation. A key aspect is that the broker can employ its (his/her) knowledge of suitable elements and then use pre-assigned permissions to ‘drill down’ into the details of airline bookings etc. to discover capabilities and prices. In so doing, the broker therefore acts on behalf of the client, combining the needs and preferences of the client with their own knowledge of the ‘solution domain’. An important aspect of this is that the broker performs these roles without needing to ‘own’ the information concerned. Unlike data mining approaches, which depend upon the availability of copies of information; or federated systems that require knowledge of how information is structured within the supplier; a broker typically operates 2
www.travelocity.com.
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
5
by using fully current information obtained directly from the sources and requires no knowledge of how this might be stored. Indeed, it is this characteristic that poses the major challenge for the IBHIS system—when mapping the concept on to the use of electronic healthcare records, how is the broker to find the relevant sources; to express the user’s needs within the context of that source; and to maintain the confidentiality required by the sources? (In turn, these characteristics also mean that data mining or federated approaches are much more difficult to employ in such a context, since the sources are likely to be autonomous as well as distributed and may only be willing to provide restricted access to the data that they own.) Within the English healthcare system, the high profile National Programme for Information Technology (NPfIT) is developing a framework for electronic patient record (EPR) systems that is essentially monolithic in nature. However, even in this context, there will be a need to exchange with, and obtain information from, other health providers (including those in Wales and Scotland) and external bodies such as social services. There are other good reasons for taking a more distributed route, including the lack of evidence for the effectiveness of such a monolithic framework in information systems, and also the ‘locality’ factor, whereby healthcare is delivered in local communities, and the effective maintenance of such information is more likely to occur when the provider of information is also a user [17]. 2.2. The Service-Based Form A key decision in the development of the IBHIS broker was to use a service architecture [3,21]. Software services can be characterised as: • • •
being used rather than owned, with no significant amount of local processing needing to be performed by a user of the service; conforming to a document-style interface, so making a service independent of programming language constructs for interconnection and data exchange; being stateless, in the sense of not preserving end-user knowledge across different episodes of use.
A range of relevant technologies for service forms has been built around the foundations of XML and SOAP, although with gaps in their provision with regard to some of the needs of the broker, an issue that we will return to in our discussion. IBHIS has also adopted two interpretations of the service concept: •
•
a service that performs functional tasks and that, within this context, is essentially static in nature—the broker itself is composed from a set of such services that will authenticate the user, help them formulate a query, and perform further post-query checking to avoid problems of inference; a service that provides information (we term this as Data as a Service or DaaS)—where the set of relevant sources is usually determined dynamically in order to address the needs and form of a particular query.
The latter has led to the concept of the Data Access Service (DAS), which forms one of the key underpinnings of the broker system. In essence, a DAS forms an interface to an autonomous agency’s record store, and is used to translate queries on to the concepts and forms used within this, as well as to help enforce any access policy rules that the agency may have. Figure 1 illustrates these concepts schematically.
6
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
Figure 1. A schematic of the IBHIS structure.
In the next section we review some of the issues that were identified in the development of IBHIS, and describe the ways in which they have been addressed.
3. The IBHIS Architecture and Implementation 3.1. Evolution of the IBHIS Architecture As indicated above, the IBHIS prototype has been developed as a ‘proof of concept’, with the aim of enabling us to explore a number of research questions relating to the usefulness of the broker approach as well as to investigate how well service architectural forms can support such an approach. The first prototype, as described in [22], was primarily concerned with investigating the capability of service forms, as well as refining our understanding of the form and needs of a broker. This employed service forms within the broker itself, but as an interim measure, it used a federated database schema to provide data access, enabling us to investigate how a DAS could be structured and the functionality that it would need to provide. The second prototype was an almost complete re-engineering of the overall architecture and its implementation, with the major characteristics being:
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
7
Figure 2. The architecture of the second IBHIS prototype.
• •
•
the development of full structures for the Data Access Services, including any associated control information used to describe the data provided, the form of interface, and local access policies; decentralisation of the access control model, so that decisions about access to items of information could be made at the appropriate points within a data request cycle, and by the most appropriate element, according to the local autonomous policies; the development of a user interface that enables the user to formulate their queries in terms of domain-related concepts (necessitating the achievement of semantic interoperability between the broker and its information sources).
While these can be summarised fairly concisely, a full technical explanation is beyond the scope of this particular paper. For this paper, we concentrate on those aspects that are most relevant to Grid issues and confine our description of these characteristics to these aspects. Figure 2 illustrates the architecture of the second IBHIS prototype. 3.2. The Data Access Service The Data Access Service (DAS) is a key element of the IBHIS architecture, as it provides a way of transparently accessing one or more distributed, heterogeneous, autonomous data sources through a service-oriented interface. A DAS is composed of four major elements as shown in Figure 3, and these are described below. DAS description file. This effectively provides the ‘service interface’ for the DAS and provides a semantic description of the elements of the DAS in terms of an appropriate ontology. The information provided in this includes: descriptions of the data (input and output) and its format; the domain and functionality that the data represents; the access policies for using the data; and any other non-functional
8
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
Figure 3. Elements of a Data Access Service.
characteristics such as cost etc. The information from this file is published into the broker’s service registry so that the query service may discover them as appropriate. The description file uses a mix of WSDL, XML and OWL to describe the characteristics of a DAS. Metadatabase. The role of this is to simplify evolution and change. The service provider maps the local data items from (possibly multiple) local data sources to the instances of the domain ontology. The advantage of this is that changes to local records only involve a change to the metadatabase, not to the description file. DAS engine. This handles queries, analyses and authenticates them, and if valid, translates them locally into the appropriate form of query (such as SQL), accessing the metadatabase as necessary to identify the local data sources that can provide the requested information. Data sources. These are the local data storage mechanisms, which may be heterogeneous and distributed. However, because of the DAS description file, the detailed storage formats actually used should not be visible to the end-user. 3.3. Distributed Access Control In the first prototype, we employed a ‘classical’ Role-Based Access Control (RBAC) model whereby a central policy was implemented by the broker itself [18]. However, to reflect the much more distributed nature of the second prototype, it proved necessary to develop a new hybrid access control model that we have termed Service-enabled Data Access Control (S-DAC).
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
9
In effect, this extends the RBAC concepts, adding some features that are appropriate to the healthcare domain—dynamic team-based permissions; identity-level permissions; and the ability to over-ride permission requirements in an emergency. Users may also grant permission to activate roles and teams to other users, modelling the transfer of authority that arises as patients move through the healthcare processes. Access control in IBHIS is therefore a three-stage process, involving: Authentication. While the broker uses a conventional log-in process, S-DAC also determines which users are authorised for particular roles and teams during a particular session (which may be time-dependent). The user may then choose to activate any number and combination of available roles, teams, or over-rides (credentials) for which they are authorized, and a further contextual check is performed at the time of activation. In this way, S-DAC ensures a dynamic separation of duties. Data Access Control. Once the user has formulated a query, this is passed to the Access Control Service (ACS) as a SOAP message. In addition to the details of the query and user credentials, this will also include details of the intended recipient DAS. The ACS checks that the query is consistent with the access control rules for that particular DAS and where necessary the query is reduced in scope (if particular attributes need to be excluded). The authorized query is sent to the DAS for execution, which authorizes the content of the query results before they are returned to the IBHIS system. Inference checking. The ACS also checks the returned items and applies an inference policy to remove any information that would allow the user to infer knowledge that they would otherwise not be authorised to access. 3.4. Semantic Interoperability This represents a major change, by which the user can effectively formulate a query using terms from a domain ontology. The advantage of using a domain ontology as the global schema is that the user can construct the query by navigating an ontology which is essentially a conceptualization of the world where he or she is operating. For example, a doctor will navigate a medical ontology which comprises a set of concepts and relationships and is structured in a way that has already been approved by the medical community (i.e. SNOMED [19]). From the integration point of view, IBHIS is a second generation mediator system based on the existence of accurate and ever-evolving domain ontologies that can be used as global views of the application domain. The assumption of pre-existing ontologies is based on the extensive effort devoted by the scientific communities into developing domain ontologies, with the results so far proving to be very promising, especially in the medical world (i.e. SNOMED [19], GALEN [5], Gene Ontology [6], etc.). The integration approach whereby we assume that a pre-existing stable global schema exists (in our case, the global schema is the domain ontology) is called localas-view [9]. The choice to adopt the local-as-view approach in the second prototype of the IBHIS systems was based on the fact that in large scale systems such as the NHS, the local data sources are autonomous and their local schemata are not under control of the mediator. Therefore, a global view cannot be generated. Even if this were not the case, it would be very difficult to integrate such a large number of database schemas
10
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
when it is common knowledge that schema integration is an error prone procedure that can never be entirely automated. Finally, a global view would be hard to maintain while the structure of the underlying data sources is constantly changing. On the contrary, domain ontologies evolve through constant interaction with the user communities. The problem of semantic interoperability is addressed through the following three key elements of our architecture: Data Description: Our objective is to describe semantically the database schemas by extending the concepts that can be described by XML, using the capabilities offered by description logic. OWL [15] is an emerging standard recommended by W3C for describing information content based on description logic. In our second prototype we used OWL files to describe the local database schemata and the mappings to the global ontology. OWL was preferred over OWL-S [13] (another standard for semantically describing web services) because OWL-S restricts the description into inputs and outputs of operations that are initially described in WSDL. Semantic Registry: This holds the domain ontology and the data description files. The registry is developed as a native XML database and is exposed as a Web Service. It offers an interface to the database providers in order to register or update their data description files but it also provides information to the query engine for query decomposition. Finally, the end users access the semantic registry as they navigate the global ontology during the query formulation process. Query Engine: Responsible for planning and executing the queries. The novel feature of the query engine is that we use a reasoner to determine the mappings between local data description and the global ontology. 3.5. Implementation The IBHIS broker is composed of an interface, which is implemented using Java JSP, Java servlets, and HTML pages, together with several internal services which perform the broker functionality. The interface and services within the broker all run within IBM Websphere Application Server v5.1 at Keele University. The broker runs on the Windows, Linux and Solaris operating systems, and porting it between these has proved a relatively straightforward exercise. As an illustration of this, the broker was originally developed on the Windows XP operating system, and the port to Linux took less than one day—with much of this time being employed in reconstructing an internal data store to use MySQL rather than Access. Subsequent porting to the Solaris operating system took less than half a day. The interface currently implemented has been primarily designed for the purpose of demonstrating the concepts. Hence much that would be done automatically in an operational system is performed manually in order to illustrate the key aspects. For operational purposes the broker will require the creation of a set of interfaces tailored to specific users and applications. (For example: supporting the Single Assessment Process would need an emphasis upon team access; Child Protection would need easy choice of relevant agencies; and some clinical processes would require emphasis upon “drill down” investigations; etc.) Authentication within this version of the prototype is achieved through encrypted passwords which are checked against a MySQL database. (However, experiments with other, more secure, forms of authentication such as X.509 Certifications, have been successful and so could be incorporated in any field trial versions of the broker.) After
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
11
Figure 4. IBHIS Role/Team Activation Screen—user specific example.
initial authentication, a list of roles and teams are presented to the user (as explained in Section 3.3.) to allow them to choose their credentials for a particular session. Figure 4 shows the role/team activation page. When using this, a user can select the combination of roles and team membership that will represent a specific period of duty. As Figure 4 illustrates, the system can be accessed remotely through any standard Web browser. Once the user has chosen a combination of roles and teams, an XML document is created to store these activated credentials. The user is then presented with the query formulation screens, which allow the user to choose what details they wish to see in a ‘global’ format, independent of any data source. The global terms are expressed through use of an ontology of healthcare terms, which is implemented using the OWL language [15]. The initial query formulation page is shown in Figure 5. The majority of the ontological interface was developed using Jena v2, including the inbuilt query language RD-QL, which is used during the formulation process to manipulate and search the ontology [10]. Once formulated, the query is added to the XML document with the credentials and sent as a SOAP document style message to the Query Service. The query service is responsible for locating the relevant DAS/data sources that can provide information relevant to the query, and the subsequent decomposition and translation of the query from the global to local terms. The matchmaking process for decomposition uses the Semantic Registry developed as a native XML database using XINDICE [23]. The broker performs this by locating suitable description files in the registry, and hence does not know in advance which data sources will be used for a particular query—or even which sources will be available. The decomposed queries are transformed into SOAP messages using XSL transformations [24]. The translated sub-query is then sent, again in SOAP document style, to the Access Control Service, which is responsible for authorising the query. The majority of the access con-
12
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
Figure 5. Ontological Query Formulation Screen.
trol uses an extended version of the XACL enforcement mechanism and policy language [11]. This was chosen due to the compatibility with Java and the ability to create views of XML documents, but required some extensions to include all of the required credentials and to enable the dynamic authorisation of SOAP documents. The ACS then sends the query as a SOAP document to the Data Access Services, which translate the XML query into the required format for the underlying data sources (SQL in the prototype). The results are sent back as an XML document over SOAP, for integration by the query service and inference authorisation by the Access Control Service. The internal services—such as the Access Control Service and Query Service—are implemented as Java classes and exposed as J2EE Web services using IBM’s Websphere Application Developer. This development environment aids the creation of WSDL description files and SOAP documents from java classes, and also helped the deployment of the services in the main application servers. A combination of the Java technologies JAX-RPC and SAAJ [20] are used, the former for the internal messaging and the latter for the XML document style messaging required for the authorisation and data access services. The Data Access Services are distributed across the three university sites, each running in copies of Websphere application server, as are the backend data sources. The heterogeneity is simulated through the use of different database management systems, specifically MySQL, IBM DB2, and Microsoft Access, running on different operating systems (Microsoft Windows 2000, and Linux)3. Figure 6 shows the current organization of the prototype in a diagrammatical form. 3
Note that a DAS is not limited to using a DBMS, and it could employ other forms of record store such as sequential files, folders etc.
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
13
Figure 6. The distributed broker environment—prototype 2.
4. Discussion Overall, our experience of using Web services as the main technology with which to develop the IBHIS prototype systems has been successful. The final prototype uses Java along with extended versions of relevant technologies from the semantic Web and Web services environments (see Section 3.5) to dynamically locate, bind to, and retrieve information from heterogeneous, distributed, autonomous data sources. However, we also believe that the ideas of the IBHIS project are compatible with the aims and technologies of the Grid (interoperability and access to distributed, heterogeneous systems) and that the majority of the system could be ported to a Grid based platform, in the process gaining all of the benefits of the two technologies. 4.1. Why Web Services? Web services was chosen as the primary technology, as at the project’s inception in January 2002 it appeared that the available Grid-based technologies were not able to provide all of the functionality that the project required. Globus, the predominant toolkit for building Grid applications, was at the time still based around low-level C APIs and was not compatible with databases [7]. IBHIS is a data-centric project, where the broker must not only be able to access any data source (including databases), but must also seamlessly retrieve and integrate data regardless of semantic or syntax related heterogeneity. Globus, and the Grid infrastructure in general, concentrated on the retrieval of large volumes of files via FTP, which was inadequate for the complex data needs of the project. Also, the Grid offered benefits in other areas that were unnecessary for IBHIS, particularly in the areas of supercomputing and high volume data transfer. When the project was initially conceived, it was viewed that whilst IBHIS was dealing with data it was mainly text-based electronic patient records, and that it was unlikely that the volume or size of that data would be a serious issue.
14
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
One of the aims of the IBHIS project is to ensure that the broker is scalable and evolvable, in terms of seamless access to any number of data sources, and to minimise the changes required when new data sources become available. A federated database system, for example, requires the federated schema to be updated quite significantly every time a new data source becomes available. The Pennine Group’s Software as a Service research is looking into service-oriented dynamic discovery and binding as a partial solution to such problems of evolution [1,2,21]. In terms of IBHIS, a serviceoriented solution enabled the data sources to be exposed as services, and for the broker dynamically to discover new data sources from the registry at run time. Also, services enable the broker to provide a consistent interface to the user, regardless of the type of databases they will be accessing. Therefore, a service-based solution seemed to offer, and indeed did offer, a viable solution to the project. The Grid infrastructure, and consequently the Globus toolkit, has developed a great deal throughout the IBHIS project’s lifetime. The Open Grid Service Architecture (OGSA) was an important development as it introduced Web services to the Grid [14]. Other projects such as OGSA-DAI (OGSA Data Access and Integration) build on top of the OGSA and allow the Grid to access databases. However, OGSA does not address dynamic discovery and binding, and in order to access data sources via a Grid infrastructure, the user must know details of the schemas and query languages of those sources. Security within the Grid, particularly the Grid Security Infrastructure (GSI) is relatively low-level, concentrating on network security and authentication. In contrast, security in the IBHIS system is chiefly concerned with access control and with representation of policies. The Community Authorization Service (CAS) is the component of the Globus Toolkit dealing with access control within Grid communities and virtual organizations [16]. However, it appears to be relatively coarse-grained and deals with access to resources, while IBHIS is concerned with fine-grained access to individual data items. CAS also offers no direct support for dynamic authorization of SOAP/XML documents, nor for documents that emanate from autonomous organizations. However, the OGSA is rapidly changing and new versions of the Globus toolkit appear to address many of these issues. Therefore, by porting the IBHIS prototype to a Grid infrastructure it would enable the project to integrate the benefits of dynamic service-oriented data access with the scale and volume capabilities of the Grid. The following sub-section examines the benefits of porting IBHIS on to the Grid structures in more detail and describes what would be involved and some of the preliminary work we have performed in this area. 4.2. A Grid Infrastructure for IBHIS The major benefit to using IBHIS within a Grid structure is that it would open up the system to other, larger data types and sources. Rather than querying largely text based patient data, it could retrieve large graphic images (for example, X-ray data) from many disparate sources. It could also deal with a larger volume of data than has previously been used in the IBHIS project test cases, for example tens of thousands, or more, of patient records, all with complex image and text related data. It also allows the project to utilise the ongoing standardisation efforts within the Grid community, such as the Web Services Resource Framework (WSRF) [4]. WSRF is designed to allow Grid applications built using Web services to describe and manage ‘state’, so that data can persist across Web services interactions. Whilst stateful Web services go beyond
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
15
the scope of the IBHIS project, WSRF is designed to describe resources and data and hence could be used as part of the DAS descriptions if IBHIS were ported to the Grid. Likewise, the Grid could benefit from the major IBHIS developments in servicebased data discovery, distributed access control, and semantic interoperability, through merging of the IBHIS work in XML (for example, DAS descriptions and access policies) with the Grid standards. Work is currently under way to port the prototype to a Grid infrastructure, using the existing data sources and the Globus toolkit v3.2 to build the Grid services. The work so far, particularly in exposing the Data Access Services within a Grid framework, has been positive. This is in part because DAS design already employs Web service technologies.
5. Conclusions As indicated in the preceding sections—the information broker concept, as realized in the IBHIS project, raises many technical challenges while also having much to offer if it were to be mapped into a Grid context. Our aim has been to highlight the challenges that would need to be addressed in such a context, drawing upon our experiences with creating the broker. Our main conclusion from this is that bringing the two developments together would have much to offer to users, both by extending the Grid-based provisions available to them, and also by extending the scope of IBHIS in terms of giving access to a wider variety of information source forms.
Acknowledgements Our thanks are due to all who helped with assembling our case studies, especially the team at the Meadows Centre, Solihull, to Philip Woodall and the anonymous reviewers for suggestions and comments, and to EPSRC for funding the IBHIS project.
References [1] Bennett, K.H., Layzell, P.J., Budgen, D., Brereton, P., Macaulay, L., and Munro, M (2000). ServiceBased Software: The Future for Flexible Software, in Proceedings of 7th Asia-Pacific Software Engineering Conference, IEEE Computer Society Press, pp. 214-221. [2] Bennett, K.H., Munro, M., Gold, N.E., Layzell, P.J., Budgen, D., and Brereton, O.P (2001). An Architectural Model for Service-Based Software with Ultra-Rapid Evolution, in Proceedings of ICSM’01, IEEE Computer Society Press, pp. 292-300. [3] Budgen D., Brereton P. and Turner M. (2004). Codifying a Service Architectural Style, in Proceedings of COMPSAC 2004, IEEE Computer Society Press, 16-22. [4] Czajkowski, K., Ferguson, D., Foster, I., Frey, J., Graham, S., Maguire, T., Snelling, D., and Tuecke. S., “From Open Grid Services Infrastructure to WS-Resource Framework: Refactoring and Evolution. Version 1.1.”, White Paper, 2004. Available: http://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdf. [5] GALEN, http://www.opengalen.org/. [6] GENE ONTOLOGY CONSORTIUM, http://www.geneontology.org/. [7] Globus Toolkit. Details available: http://www.globus.org. [8] I. Kotsiopoulos, J. Keane, M. Turner, P.J. Layzell and F. Zhu (2003). IBHIS: Integration Broker for Heterogeneous Information Sources, in Proceedings of COMPSAC 2003, IEEE Computer Society Press, 378-384. [9] A. Y. Halevy (2001). Answering queries using views:A Survey, VLDB Journal, 10(4), 270-294.
16
D. Budgen et al. / Managing Healthcare Information: The Role of the Broker
[10] Jena—A Semantic Web Framework for Java: http://jena.sourceforge.net. [11] Kudo M. and Hada S. (2001). Access Control Model with Provisional Actions, IEICE Trans. Fundamentals, E84-A, 72-88. [12] McGuinness DL and Harmelen F (2003). OWL Web Ontology Language Overview, World Wide Web Consortium (W3C) Candidate Recommendation, http://www.w3.org/TR/owl-features. [13] McIlraith S. and Martin D. (2003). Bringing Semantics to Web Services, IEEE Intelligent Systems, 18(1), 90-93. [14] Open Grid Services Architecture: http://www.gridforum.org/L_WG/News/OGSA%20Flyer_v31.pdf. [15] OWL Web Ontology Language Reference - W3C Recommendation, http://www.w3.org/TR/owl-ref/, S. Bechhofer, F. v. Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. P.-S. Technologies, and L. A. Stein. [16] Pearlman L, Welch V, Foster I, Kesselman C and Tuecke S (2002). A Community Authorization Service for Group Collaboration, in Proceedings of 3rd International Workshop on Policies for Distributed Systems and Networks, IEEE Computer Society Press, 50-59. [17] Rigby MJ., Budgen D., Brereton OP., Bennett KH., Layzell PJ., Keane JA., Russell MJ., Kotsiopoulos I., Turner M. and Zhu F. (2005). A Dynamic Data-Gatherer as an Emergent Alternative to SupraEnterprise EPR Systems, accepted for HealthCare Computing 2005. [18] Sandhu, R.S., Coyne, E., Feinstein, H., and Youman, C. (1996). Role-Based Access Control Models, IEEE Computer, 29(2), 38-47. [19] SNOMED International, http://www.snomed.org/. [20] SOAP with Attachments API for Java (SaaJ): http://java.sun.com/xml/saaj/index.jsp. [21] M. Turner, D. Budgen and O.P. Brereton (2003). Turning Software into a Service, IEEE Computer, 36(10), 38-44. [22] Turner, M., Zhu, F., Kotsiopoulos, I., Russell, M., Budgen, D., Bennett, K., Brereton, P., Keane, J., Layzell, P., and Rigby, M. (2004). Using Web Service Technologies to create an Information Broker: An Experience Report, in Proceedings of ICSE 2004, IEEE Computer Society Press, 552-561. [23] Apache Xindice: http://xml/apache.org/xindice. [24] XSL Transformations (XSLT): http://www.w3.org/TR/xslt, W3C Recommendation, November 1999. [25] Zhu F., Turner M, Kotsiopoulos I., Bennett K. Russell M., Budgen D., Brereton P., Keane J., Layzell P., Rigby M. and Xu J. (2004). Dynamic Data Integration Using Web Services, in Proceedings 2nd International Conference on Web Services (ICWS 2004), IEEE Computer Society Press, 262-269.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
17
Using Ontologies in PROTEUS for Modeling Proteomics Data Mining Applications Mario CANNATARO 1 , Pietro Hiram GUZZI, Tommaso MAZZA, Giuseppe TRADIGO and Pierangelo VELTRI Magna Græcia University, 88100 Catanzaro, Italy Abstract. Bioinformatics applications are often characterized by a combination of (pre) processing of raw data representing biological elements, (e.g. sequence alignment, structure prediction), and an high level data mining analysis. Developing such applications needs knowledge of both data mining and bioinformatics domains, that can be effectively achieved by combining ontology about the application domain and ontology about the approaches and processes to solve the given problem. In this paper we talk about using ontologies to model proteomics in silico experiments. In particular data mining of mass spectrometry proteomics data is considered. Keywords. Proteomics, Data mining, Ontologies, PROTEUS
1. Introduction Bioinformatics is an emerging research field aiming to help the biologic research with informatics tools. In this sense bioinformatics is a multidisciplinary area that involves performing critical experiments (e.g. sequence alignment, structure prediction), organizing and storing of collected data (e.g. protein data banks), extracting knowledge from data and sharing it (e.g. verify hypothesis about diseases, sharing commonly agreed biomedical practices and protocols). In a more broad perspective, biomedical experiments whose data are analyzed through bioinformatics platforms, involve different technologies such as mass spectrometry, bio-molecular profiling, nanotechnology, computational chemistry, drug design, and so on. The way in which data are produced, the possible errors affecting them, the assumptions and the approaches to analyze them, known to biomedical domain experts, should be taken into account when choosing a particular data mining approach or algorithm. The problem addressed here is how to enhance the design of complex “in silico” experiments combining, in a unique Bioinformatics platform, both data mining and biomedical knowledge. Basic technologies used are Data Mining (DM) to analyze data, On1 Correspondence to: Mario Cannataro, Magna Græcia University, Via T. Campanella 115, 88100 Catanzaro, Italy. Tel.: +39 0961 369 4001; Fax: +39 0961 369 4075; E-mail:
[email protected].
18
M. Cannataro et al. / Using Ontologies in PROTEUS
tologies to model knowledge, and Workflows to design experiments with several steps involving different informatics tools and spanning different domains. In bioinformatics, data mining [16] is useful both in extracting knowledge from articles using Text Mining, and in extracting rules and models from databases and experimental data. Ontologies have a broad range of applicability in bioinformatics [13], such as classification of medical concepts and data, database integration [12] and collaboration between different groups. A workflow is a partial or total automation of a process, in which a collection of activities must be executed according to certain procedural rules. Workflow Management Systems (WfMSs) allow the design of workflow, supporting its enactment, by scheduling different activities on available entities [17]. PROTEUS [3] is a Grid-based Problem Solving Environment that uses ontologies to model application domain, workflow techniques to compose distributed in silico applications, and is developed on Grid middleware. Mass Spectrometry (MS) is a widely used technique for the mass spectral identification of the thousands of proteins that populate complex biosystems such as serum and tissue. The combined use of MS with DM is a novel approach in proteomic pattern analysis and is emerging as an effective method for the early diagnosis of diseases [5]. A main goal of the paper is to show the use of ontologies in PROTEUS to model and compose “in silico” proteomics experiments where data mining techniques are used to find patterns and classify mass spectrometry proteomic data. The rest of the paper is organized as follows. Section 2 describes the workflow of a representative proteomic data mining application. Section 3 describes ontologies in PROTEUS and their use to design proteomic experiments. Finally Section 4 concludes the paper and outlines future work.
2. Workflow of a Representative Proteomic Experiment The bio-medicine research group of our university is affording the study of the breast cancer and the overexpression of the HSP90 (heat shot proteins) in the chemo resistant patients [4]. The study is addressed to discover where this overexpression occurs in order to block the production excess of HSP90 and, consequently, to test if this hypothesis is valid and useful to make effective the chemotherapy. Mass Spectrometry is currently a hot research area and this approach is quickly becoming a powerful technique in order to identify different molecular targets in different pathological conditions. The proteomics experiment comprises two main phases: 1. Mass Spectrometry analysis. This phase receives in input a set of biological samples (e.g. cells, tissues, serum), and produces as output a set of raw data (spectrum). It comprises the following sub-phases: Sample Preparation (Cell Culture, Tissue, Serum), Proteins Extraction, ICAT protocol, Mass Spectrometry. 2. Data Mining analysis. This phase comprises three main sub-phases: Data Preprocessing, Data Clustering and Data Classification. Mass Spectrometry analysis. Sample Preparation, Proteins Extraction, and ICAT Protocol refer, respectively, to the choice of samples to be analyzed (in our experiments we consider serum, tissue, and cell culture samples), the selection of proteins from samples, and their treatment before mass spectrometry. Mass Spectrometry is a powerful method-
M. Cannataro et al. / Using Ontologies in PROTEUS
19
Figure 1. Example of a MS spectrum, Low Mw window: 1000-12000 m/z.
ology for determining the masses of biomolecules and biomolecular fragments present in a complex sample mixture. Mass Spectrometry data is represented, at a first stage, as a (large) sequence of value pairs, where each pair contains a measured intensity, which depends on the quantity of the detected biomolecule, and a mass to charge ratio (m/Z), which depends on the molecular mass of detected biomolecule. Due to the large number of m/Z data contained in a mass spectrum obtained by a real sample, analysis by manual inspection is not feasible. Usually mass spectra are represented in a graphical form as in Figure 1. Data Mining analysis. Biological data mining is an emerging research area. The high volume of mass spectrometry data is a natural application field of data mining techniques. In fact large sequences of m/Z data contain a lot of information in an implicit way. Manual inspection of experimental data is difficult despite biological relevance of conformation of peak list. For this reason both computational method and soft-computing techniques can allow automatic clustering, classification, and pattern discovery. Our aim is to build some clusters (i.e. diseased, healthy) in which classify each new spectrum collected. This process needs to identify the distinctive characteristics of each group and then find these in new spectra. Early detection of cancer can benefit of the high throughput of Mass Spectrometry and computational methods. The works in [5,11], and [15] explain the application of these methods in ovarian, prostate and bladder cancer. The Data Mining analysis comprises a preprocessing phase, that is particularly important and complex when facing biological data, and the application of machine learning techniques, such as those found in classification and clustering. Data Preprocessing. When considering MS data preprocessing is the process that consists in spectrum noise and contaminants cleaning up. Moreover, such phase can also be used to reduce dimensional complexity of the problem, but it is important to use efficient and biological consistent algorithms. Currently this is an open problem. Each point of a spectrum is the result of two measurements (intensity, and m/Z) and is corrupted by noise. Preprocessing aims to correct intensity and m/Z values in order: (i) to reduce noise, (ii) to reduce amount of data, and (iii) to make spectra comparable. The noise reduction and normalization activities comprise: • Identification of Base Line. It tries to identify the base intensity level (baseline) of each mass spectrum which varies from sample to sample and consequently to subtract it. The underlying hypothesis is that baseline is a variable noise. • Normalization of Intensities. It enables the comparison of different samples since the absolutes peak values of different fraction of spectrum could be incomparable.
20
M. Cannataro et al. / Using Ontologies in PROTEUS
Figure 2. Workflow of a proteomic data mining application.
The methods for data reduction are: • Binning. It consists of a linear and iterative function which calculates, for each mass interval of m/Z values, an aggregate intensity (e.g. the sum) of the peaks intensity detected in the same interval and substitutes all peaks present in this interval with only one peak having such aggregate intensity and a (e.g. medium) m/Z value. • Identification and extraction of peaks. It consists of separating real peaks (e.g. corresponding to peptides) from peaks representing noise. It can be performed by using the data-processing embedded in Mass Spectrometer or, in case of raw data, we are studying an implementation of a custom identification method fitting both informatics and biological consideration. Finally, to allow an easy comparison of different spectra the following method is used: • Alignment of correspondent peaks. It finds a common set of peak locations (m/Z values) in a set of spectra, in such a way that all spectra have common m/Z values. Data Clustering and Classification. In our system we are using the Q5 classification algorithm [9]. The version of Q5 used in our experiment, that works on the whole spectrum, is implemented in Matlab script and is freely downloadable (see http://www.cs.dartmouth.edu). Q5 is a solution to the problem of classification of complete mass spectra of a complex protein mixture. Q5 employs a probabilistic classification algorithm built upon a dimension-reduced linear discriminant analysis. In particular Q5 uses Principal Component Analysis (PCA) to reduce the dimensionality of a data set without great information losses, and linear discriminant analysis (LDA) to discriminate between classes. From a geometrical point of view, each spectrum can be represented as a vector in n-dimensional space. PCA takes the cloud of data points, and rotates it such that the maximum variability is visible. Another way of saying this is that it identifies your most important gradients. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques and can provide clues as to the molecular identities of differentially expressed proteins and peptides. Figure 2 shows the main steps of a proteomic data mining application, where raw mass spectra, produced by the spectrometer, are first preprocessed by using a combination of the techniques discussed before, and then are classified by using supervised classification algorithms.
3. Using Domain Ontologies in PROTEUS to Enhance Application Design PROTEUS [3] is a Grid-based Problem Solving Environment for bioinformatics applications. It allows modelling, building and executing bioinformatics application on the Grid.
M. Cannataro et al. / Using Ontologies in PROTEUS
21
Figure 3. PROTEUS architecture.
It combines existing software tools and data sources by (i) adding metadata to software tools and data sources, (ii) modelling biological tasks and classifying bioinformatics resources through ontologies, (iii) representing applications through workflows, and (iv) offering pre-packaged bioinformatics applications. Figure 3 shows main components of PROTEUS architecture. The Component and Application Library contains software tools, databases, data sources, and user-defined bioinformatics applications, whose metadata are contained into the Metadata Repository. The Ontology Repository contains ontologies describing, respectively, biological concepts, bioinformatics tasks, and user-defined bioinformatics applications, represented as workflows. The Ontology-based Workflow Designer allows the design of a bioinformatics application as a workflow of software and data components selected by searching PROTEUS ontologies. It comprises the Ontology-based Assistant, that suggests available tools for a given bioinformatics problem, and the Workflow User Interface, used to produce workflow schema, stored into the Workflow Metadata Repository. Finally, the WF-model Wrapper maps an abstract workflow schema into a schedulable workflow, that in turn is scheduled (i.e. enacted) on the Grid by the Workflow Engine. The rest of the Section describes the role of domain ontologies in PROTEUS. DAMON ontology represents the set of tools and methodologies to conduct data mining analysis on data, whereas PROTON ontology describes specific features and details of the domain under investigation: i.e. proteomics and mass spectrometry experiments. The former is used to choose the kind of analysis to be conducted, whereas the latter describes how specific data are represented and should be processed. 3.1. DAMON: an Ontology for the Data Mining Domain DAMON represents the features of the main data mining software tools, classifying their main components and evidencing the relationships and the constraints among them [2]. The categorization of the data mining software has been made on the basis of the following classification parameters: • the data mining task performed by the software;
22
M. Cannataro et al. / Using Ontologies in PROTEUS
Figure 4. Modelling of protein concept.
• the type of methodologies that the software uses in the data mining process; • the kind of data sources the software works on; • the degree of required interaction with the user. In DAMON we have several small local taxonomies derived from the specialization of the basic classes. By browsing DAMON a researcher can choose the optimal Data Mining techniques to utilize. Given a specified kind of data a user can browse the ontology taxonomy and find the preprocessing phase, the algorithms and its software implementation. In summary, the DAMON ontology allows the semantic search (conceptbased) of data mining software and others data mining resources, and suggests the user the methods and software to use on the basis of stored knowledge and user’s requirements/needs. 3.2. PROTON: an Ontology for the Proteomics Domain PROTON ontology models concepts, methods, algorithms, tools and databases relevant to the proteomics domain. Using a top-down approach we first defined classes for the general proteomics concepts, and then we specialized them. The rationale is that a fundamental distinction between biological concepts (as proteins) and non-biological concepts (as software) exists. 3.2.1. Classification of Biological Concepts Developing PROTON we assumed that a protein is a tangible thing that does not have temporal or spatial dynamic. We summarize the post-translational modification as an attribute. The justification of our choice is that all the algorithms operate over static representation of proteins in their native forms. In our conceptualization we introduce three primary entities: (i) Aminoacid, (ii) Protein, (iii) Structure. The Structure concept can be specialized in Primary, Secondary, Tertiary and Quaternary, respecting the biological classification. Figure 4 shows an UML model of these concepts. It is important to precise that each protein has an unique identifier, and a unique primary structure. These determine the spatial conformation, that is secondary and tertiary structures, despite of quaternary that identifies globular proteins. In this way, for each protein we can assign its own structures. 3.2.2. Classification of non Biological Concepts In the following we describe main non biological concepts and their specializations.
M. Cannataro et al. / Using Ontologies in PROTEUS
23
Figure 5. Modelling of protein analysis concept.
Analysis. With this concept we model theoretic study of proteins. Introducing this concept is important since in medical research the literature citations of a method is a criterium of evaluation. This class can be specialized in the following subclasses: • • • • • •
Mass-Spectra Analysis Interaction Primary Structure Analysis Secondary Structure Analysis Tertiary Structure Analysis Quaternary Structure Analysis
It is evident that a simple relation exists between each kind of analysis, as depicted in Figure 5. Task. A Task is a concrete problem that the researcher has to solve. Specialization of this concept in sub-classes belongs the principals area of proteomic study and comprises the following sub-classes (see Figure 6): (i) Interpretation of MS data has several different aspects that will be underlined in sub-classes and can be used to identify a protein [14], or to recognize a disease; (ii) Alignment is a classical task in proteomics, it can be specialized in Sequence Alignment [10] when the primary structures are compared, and Structural Alignment [8] if spatial conformations are compared; (iii) Prediction is a classical problem in proteomics. A researcher is often interested in prediction of secondary or tertiary structure of a protein starting by its primary sequence [7]. Method. A Method is a way to perform a Task (see Figure 7). Software. A Software is an implementation of a Method in a web-server, stand-alone application or grid-node.
24
M. Cannataro et al. / Using Ontologies in PROTEUS
Figure 6. Task classification.
Figure 7. Method classification.
3.2.3. Relations Between Domain Concepts Once classified all the domain concepts we are able to define the relation between two concepts of different classes. Main relations are discussed in the following. is Chain of explains that a protein is a sequence of aminoacids. Has A links a protein to its own structures. Studies links a particular analysis and a proteic structure explained. Implements links a software and a method implemented. 3.3. Ontology-Based Design of Proteomics Applications on PROTEUS In order to perform a complete in silico experiment is necessary to cooperate between the biologic and bioinformatics group. Ontology can help in allowing such cooperations linking experimental researches and bioinformatics tools. If a researcher needs to cluster spectral data in order to recognize healthy/diseased condition over serum, he/she can browse PROTON and find in Analysis taxonomy the literature related in order to recognize already used methods. Then he/she can explore the Method taxonomy and can find experimented methods, such as soft computing techniques, probabilistic techniques, data preprocessing techniques, and so on. Finally, links to Algorithms can be found (e.g., Q5 algorithm [6]) and, through relations between PROTON and DAMON, user can find in
M. Cannataro et al. / Using Ontologies in PROTEUS
25
Figure 8. Fragment of DAMON ontology showing Q5.
the DAMON ontology the software performing algorithm (e.g., the software performing Q5) and all related information. This integration allows the collaboration between two heterogenous groups of researchers: biologists and bioinformatics. The design and execution of an application on PROTEUS comprises the following steps: Ontology-based component selection. Browsing PROTON a user can find at first the literature related to the problem under investigation in order to select a method that is already tested. Then user selects the software resources connected to the chosen method (i.e. software tools performing method and servers where they are offered). In our experiment, the analysis of Mass Spectrometry data, a user can expand and visit individuals of Data Interpretation Tools and find Data-Preprocessing, Data Clustering, and Data-Classification. When a method for Classification (e.g. Q5) is selected, the user can browse DAMON ontology and find all information and resources relates as shown in Figure 8. Workflow design. Selected components are combined producing a workflow schema that can be translated into a standard language, as those specified by the Workflow Management Coalition (WfMC) [17]. Application execution on the Grid The workflow is scheduled by a workflow engine on the Grid. In particular PROTEUS processes take care of communication between workflow engines and data movement. In turn, such functions leverage Grid middleware services, e.g. Globus [1]. Results visualization and storing. After application execution and result collection, the user can enrich and extend the PROTEUS ontologies.
4. Conclusion and Future Works We presented the use of different ontologies to model a proteomics experiment. We focused on the use of a proteomic ontology in order to retrieve methods useful to solve bioinformatics problems, and a data mining ontology in order to select computational methods. Browsing both ontologies can bridge the gap between biological context and informatics context and permit to a biologist to drive computational methods. Such idea has been implemented in PROTEUS, a Grid-based Problem Solving Environment. Our aim is to continue integration between PROTON and DAMON, i.e. between the domain
26
M. Cannataro et al. / Using Ontologies in PROTEUS
of a problem (proteomics analysis) and the domain of candidate solutions (data mining), in order to develop a bioinformatics ontology that covers phases of in silico proteomics experiments.
References [1] Globus Alliance, Globus toolkit, http://www.globus.org/. [2] M. Cannataro and C. Comito, A DataMining Ontology for Grid Programming, Workshop on Semantics in Peer-to-Peer and Grid Computing (in conj. with WWW2003) (Budapest-Hungary), 2003. [3] M. Cannataro, C. Comito, F. Lo Schiavo, and P. Veltri, Proteus, a Grid based Problem Solving Environment for Bioinformatics: Architecture and Experiments, IEEE Computational Intelligence Bulletin 3 (2004), no. 1, 7–18. [4] G. Cuda, M.Cannataro, B. Quaresima, F. Baudi, R. Casadonte, M.C. Faniello, P. Tagliaferri, P. Veltri, F.Costanzo, and S. Venuta, Proteomic Profiling of Inherited Breast Cancer: Identification of Molecular Targets for Early Detection, Prognosis and Treatment, and Related Bioinformatics Tools, WIRN 2003, LNCS (Vietri sul Mare), Neural Nets, vol. 2859, Springer Verlag, 2003. [5] Petricoin EF Ardekani AM Hitt BA Levine PJ Fusaro VA Steinberg SM Mills GB Simone C Fishman DA Kohn EC and Liotta, Use of proteomic patterns in serum to identify ovarian cancer , Lancet 359 (2002), 572–577. [6] H Farid and B R Donald R H Lilien, Probabilistic disease classification of expression-dependent of proteomic data from mass spectrometry of human serum, Journal of Computational Biology (2002). [7] C. Guerra and S. Istrail (Eds.), Mathematical methods for protein structure analysis and design, LNBI 2666, Springer, 2000. [8] R. Lathrop, R. Jr, J. Bienkowska, B. Bryant, L. Butorovic, C. Gaitatzes, R. Nambudripad, J. White, and T. Smith, Analysis and algorithms for protein sequence-structure alignment, Comp. Methods in Molecular Biology 12 (1998), 227–283. [9] Ryan H. Lilien, Hany Farid, and Bruce R. Donald, Probabilistic disease classification of expressiondependent proteomic data from mass spectrometry of human serum, Journal of computational biology 10 (2003), no. 6, 925–946. [10] H Carrillo Lipmann, The multiple sequence alignment problem in biology, J Appl Math 48 (1998), 1073 1082. [11] Semmes OJ Pdam BL, Vlahou A and Wright GL Jr, A: Proteomic approaches to biomarker discovery in prostate and bladder cancers , Proteomics 1 (2001), no. 1264-1270. [12] Koler J Steffen Schulze-Kremer, The semantic metadatabase (semeda):ontology based integration of federated molecular biological data sources, In Silico Biology (2002), 0021. [13] Steffen Schulze-Kremer, Ontologies for molecular biology and bioinformatics, In Silico Biology (2002), 2. [14] V Dancik T Addona K R Clauser J E Vath and P A Pevzner, De novo peptide sequencing via tandem mass spectrometry, Journal of Computational Biology 6 (1999), 327–342. [15] Mendrinos S Patel K Kondylis FI Gong L Nasim S Vlahou A Schellhammer PF and Wright GL Jr, Development of a novel proteomic approach for the detection of transitional cell carcinoma of the bladder in urine , Am J Pathol 58 (2001), no. 1491-1502. [16] Jason Tsong-Li Wang, Qicheng Ma, Dennis Shasha, and Cathy H. Wu, Application of neural networks to biological data mining: a case study in protein sequence classification, Knowledge Discovery and Data Mining, 2000, pp. 305–309. [17] WfMC, Workflow management coalition reference model, http://www.wfmc.org.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
27
Applying a Grid Technology to Protein Structure Predictor “ROKKY” Kazutoshi FUJIKAWA a,1 , Wenzhen JIN b , Sung-Joon PARK b , Tadaomi FURUTA b , Shoji TAKADA b , Hiroshi ARIKAWA c , Susumu DATE d and Shinji SHIMOJO e a Information Technology Center, Nara Institute of Science and Technology b Department of Chemistry, Kobe University c Graduate School of Information Science, Nara Institute of Science and Technology d Graduate School of Information Science and Technology, Osaka University e Cybermedia Center, Osaka University Abstract. This paper describes a sub-project of BioGrid project called “HTC (High Throughput Computing) group.” Generally, a protein structure prediction which requires large amount of computational resources is done by trial-and-error method. HTC group have been developing a high throughput computing system with a flexible workflow handling mechanism for a protein structure prediction. In this paper, we show how to apply our high throughput computing system to the protein structure predictor called “ROKKY.” Keywords. High Throughput Computing, Workflow, Protein Structure Prediction, GUI
1. Introduction As an infrastructure, IT (information technology) provides a great possibility for science. Science can be more advanced by applying IT to conventional science such as genome information science and financial engineering, etc. Particularly in life science field, collaboration among industry-academic-government establishes the global competitive edge with multiplier effects of active utilization of computer technology and network environment. Besides, we can expect the establishment of venture business that is based on developed technology. Construction of a Supercomputer Network, otherwise known as BioGrid project [8], is one of national R&D projects in IT-Program granted by Ministry of Education, Culture, Sports, Science and Technology since 2002. We desire to promote IT applied research specialized in medical and bio science, both of which Osaka University and other relevant institutions in KANSAI region of Japan have gained the lead on a global basis. Equipped with advanced computational resources and network, Cybermedia Center (CMC) Osaka University is considered as the core of the project to conduct integration 1 Correspondence to: Kazutoshi Fujikawa, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan. Tel.: +81 743 72 5151; Fax: +81 743 72 5149; E-mail:
[email protected].
28
K. Fujikawa et al. / Applying Grid Technology to ROKKY
of applied technology in these fields. While keeping a wide range of collaborative research with national research institutions or private sectors, we conduct quick technical development in very competitive fields: IT, medical science, and biology. We also create new R&D environment that contains a promotion of business opportunities for applied technology developed in the project. In BioGrid project, there are five R&D groups. As one of such R&D groups, High Throughput Computing (HTC) group, that is a cooperation of IT researchers and biological and medical scientists, has been developing a job/task management platform of Grid environments for biological and medical scientists. A lot of biological and medical scientists expect that Grid technology can accelerate their resarches, because Grid environments provide more computational resources and Grid middlewares have extended their functionality to use computational resource flexibly and seamlessly. However, since most Grid middlewares require some special knowledge or technique of biological and medical scientists, they cannot easily use Grid environments. Meanwhile, biological and medical scientists perform some scientific simulations by cut-and-try methods to search appropriate results. That is, they may modify simulations by changing the simulation method or the input parameters according to the intermediate results of simulations. However, most Grid systems are not suitable for the simulations done by cut-and-try methods. Therefore, scientists cannot make effective use of the limited computational resources. To solve the above problems, HTC group is to realize a Grid environment which biological and medical scientists can use without any special knowledge or technique. In this paper, we propose a new Grid system with a graphical user interface (GUI) tool which can support cut-and-try methods. Through the proposed Grid system, users can easily manage their own jobs/tasks by specifying the strategy such as job/task execution. Also, our Grid system can efficiently use computational resources according to the specified strategy. We apply our Grid system to a protein structure prediction. This paper is organized as follows: Section 2 describes BioGrid project and HTC group. Section 3 mentions an overview of protein strcuture prediction and a prediction server called “ROKKY.” Section 4 discusses the requirement of applying a Grid technology to ROKKY, and Section 5 describes the proposed high throughput computing system, which can support ROKKY, in detail. Finally, we conclude this paper in Section 6.
2. BioGrid Project and High Throughput Computing Group A dramatic advancement in science and technology has made it possible for us to observe not only microscopic cells but also the brain mechanism using leading-edge and sophisticated devices such as MEG (Magnetoencephalography). Additionally, new protein structures and gene information have been made clarified and stored in the databases, which may hold clues to our inquiry into the essence of the biological life process and contribute to future medical treatment although these databases may be so diverse and distributed that no one could draw a whole picture. Simulation may become possible in atomic, molecule, cell and organizational level although it requires huge computational resources. We consider Grid technology as a glue to integrate observation devices, databases and computational resources for advanced life science. In the BioGrid project, we tackle this goal by the following five groups shown in Figure 1: 1) Core Grid technology
K. Fujikawa et al. / Applying Grid Technology to ROKKY Project Leader Shinji Shimojo (Osaka University) Management Committee Core Grid Group
29
Advisory Committee Project Management Senri International Information Institute
Leader : Susumu Date (Osaka University)
Osaka Univ., NEC System Technologies, Mitsui Knowledge Industry, Senri International Information Institute Data Grid Group
Leader : Hideo Matsuda (Osaka University)
Osaka Univ., Kyoto Univ., Hitachi Software Engineering, Protein Research Foundation, Hewlett-Packard Japan, Fujitsu Kyushu System Engineering, Mitsubishi Space Software, Aztec System Computing Grid Group
Leader : Haruki Nakamura (Osaka University)
Osaka Univ., NEC Fundamental Research Laboratories, Hitachi, Biomolecular Engineering Research Institute HTC Group
Leader : Kazutoshi Fujikawa (Nara Institute of Science and Technology)
Osaka Univ., Kobe Univ., Nara Institute of Science and Technology, Japan Atomic Energy Research Institute Telescience Group
Leader : Naoto Yagi (Japan Synchrotron Radiation Research Institute )
Osaka Univ., Japan Synchrotron Radiation Research Institute, National Institute of Advanced Industrial Science and Technology
Figure 1. Project Members of BioGrid Project.
group is responsible for the production of the Grid technologies with which a research infrastructure for advanced life sciences is established as well as the supporting of other group’s research and development. To realize the research infrastructure for advanced life sciences, this group has been conducting research and development accelerated by the needs, wants, and demands from researchers and scientists. 2) Data Grid group is developing several fundamental technologies for making full use of biological information by the seamless federation of multi-scale databases with the Grid technology. The group has been conducting R&D on a system that allows data retrieving from protein and compound databases seamlessly. 3) Computing Grid group has resposibilty to gridify computational programs at various levels, from microscopic-level analysis of electrons and molecules to macroscopic-level studies of cells and organs. Also, this group tries to integrate the different computational methods on the Grid architecture to promote life science research further. 4) High Throughput Computing (HTC) group tries to enhance the computational throughput for a variety of bioscience applications through the dynamic aggregation of computers on the Grid. This group has worked on the building of analytic workflow for an actual scientific problem of protein 3D structure prediction and realization of an efficient DB-based administration for a large number of analytic results generated by the Grid. 5) Telescience group is aimed at improving the throughput of special devices such as ultra-high voltage electron microscopy by reducing image processing time. This group also targets on providing telecontrol function and improving accuracy of each experiment with knowledge sharing system to meet needs coming from bioinformatics field. Figure 2 shows the relationship of five groups in BioGrid project. As shown in this figure, Data Grid group, Computing Grid group, HTC group, and Telescience group have been developing application integration technologies with the support of Core Grid group. As application integration technology, HTC group provides a new Grid system which can dynamically manage job/task execution according to user’s strategy. Also, we provide a GUI tool on which users can specify the strategy of job/task execution, parameter modification, and simulation method modification. Currently, HTC group applies
30
K. Fujikawa et al. / Applying Grid Technology to ROKKY Application Integration Technology (Grid Portal)
HTC High Throughput Computing Protein structure prediction
Computing Grid
Data Grid
Telescience
Heterogeneous Database Federation
Biosimulation Platform United on Grid Architecture
Core Grid BioGrid Infrastructure
Data Online Analysis Analysis Platform for Observation Data
Secure File Sharing GSI-SFS
Secure Grid Infrastructure (IPv6 based Grid)
6Grid Globus6, GridFTP
Figure 2. Relationship of BioGrid Project.
our Grid system to a protein structure prediction system called “ROKKY” [12]. HTC group is being conducted mainly by Kobe University1 and Nara Institute of Science and Technology2 .
3. Protein Structure Prediction Here, we will briefly describe the protein structure prediction and a protein structure prediction system called “ROKKY” which is originally designed at Kobe University. Then we will mention a Grid-oriented modification on ROKKY to overcome the problem of protein structure prediction. 3.1. Overview Proteins are bio-molecular machines which are responsible for virtually all cellular functions. Often, a protein function is closely related to its three-dimensional structure and thus knowledge of the three-dimensional structure is the key step for understanding cellular biology at molecular level. Since Anfinsen’s experiment, it has long been believed that the three-dimensional structure of a protein is in its free energy minimum and is encoded by its amino acid sequence information and thus, in principle, the structure can be predicted by the sequence through minimization of its free energy. Over 40 years, however, solution of this protein structure prediction problem was extremely difficult to reach. In 90’s and later, situation has drastically changed, because of the increase of experimentally determined structures and the improvement of computer facilities and algorithms. Now, a lot of people believe that the structure prediction problem can really be solved. Currently, strategies of a structure prediction can be classified into two, depending of the type of a target sequence. One is the case where strong or weak homology can be detected between the target sequence and a sequence in the protein structure database (PDB), of which structure can be used as a template. In this case, we call it a target “with 1 http://theory.chem.sci.kobe-u.ac.jp/. 2 http://inet-lab.naist.jp/.
K. Fujikawa et al. / Applying Grid Technology to ROKKY
31
a template protein.” Otherwise, we do not detect a reliable template, thus calling it a target “without template protein.” “Comparative Modeling (CM)” method is useful for targets that have significant sequence similarity with known structures, i.e., templates. “Fold Recognition (FR)” method can detect templates that are much evolutionally distant from the targets. When FR method could not detect a reliable template, the protein structure of the target possibly has a “New Fold (NF)” for which de novo structure prediction by a pseudo-folding simulation is required [2]. Here, we emphasize that de novo prediction is the most difficult part and relies on very much a computer-intensive pseudo-folding simulation. 3.2. Protein Structure Prediction Server “ROKKY” ROKKY is a web-based fully-automated server that can predict structure given an amino acid sequence. Performance of ROKKY was benchmarked in recent world-wide blind test at the CASP6 [7], finding that ROKKY is the second best prediction server among over 50 servers for targets without template protein. The distinctive feature of ROKKY is that the system integrates the standard bioinformatics tools/servers and a fragment assembly simulator called “SimFold” [4,6]. Although the ranking is relatively good, a fully-automated prediction often fails for many targets without template protein. A human intervention is sometimes useful and could resolve the above problem. Thus, besides its own performance, ROKKY is designed to assist structure predictions by integrating some useful information about targets given by users. Then ROKKY will provide automated prediction results. 3.3. Workflow of Protein Prediction For all targets, ROKKY first performs sequence analysis with “PSI-BLAST” [1], which is one of CM systems, using a non-redundant sequence database and the PDB. When a structural template with E-value smaller than 0.001 is found, ROKKY uses the alignments and makes three-dimensional models by inserting loop structures in the alignment gap. Otherwise, ROKKY submits the target sequence to “3D-Jury” [5], which is an FR system, for seeking available structural templates with weak homology. When the score of 3D-Jury higher than 30.0 is found, ROKKY uses the templates and makes model structures after loop insertion in the alignment gap. For the rest of targets that have no template structures, ROKKY performs Fragment Assembly Simulated Annealing (FASA) with SimFold that is the most promising NF system [3]. 3.4. Requirement of Human Intervention De novo prediction is intrinsically difficult. Fully-automated servers, no matter how good they are in an assessment ranking, often fail in prediction in many unobvious reasons. Sometimes a visual inspection of early stage results by experts could find the reason of failure and the way to overcome it. Here we show a couple of examples: For some FR target, a experts knows that FASA can predict more appropriate structure than a templatebased modeling does. This is because of the low-resolution template, large gaps, and poor sequence alignments. The other example would be concerned with a multi-domain target. We have prepared a domain-parsing procedure in ROKKY. This kind of problem
32
K. Fujikawa et al. / Applying Grid Technology to ROKKY
is due to not only the quality of domain boundary prediction, but also unique single domain that does not fold alone. Thus, re-parsing domain boundaries of such target or running FASA for a multi-domain target as a single domain is needed. In the above cases, a human needs to intervene in the job execution of ROKKY. As a result, more informative results of ROKKY help future structure predictions, and these results become a good training set to learn the knowledge of experts. 4. Requirements of Applying Grid Technology Here, we will explain the requirements to apply the protein structure prediction system to the high throughput computing. 4.1. Requirements for System Architecture A protein structure prediction problem is to search quickly a target structure of a protein from a very large structure space. Since the calculation of a protein structure prediction can generally be parallelized, the high throughput computing system is suitable for a protein structure prediction problem. In high throughput computing systems, there are two kinds of Grid tools: one is some batch queue management tool such as “Condor” [9] or “OpenPBS” [11]. The other is to use temporal resources of idle computers. “SETI@home” [13] or “Folding@home” [10] can be taken as an example of this kind of high throughput computing system. In this paper, we deal with the former as high throughput computing systems. Some differences exist among Grid tools, in terms of capability for heterogeneous environments, extensibility for additional computational nodes, checkpointing facility, proccess migration, or MPI libraries. Moreover, the operation of job submission also varies with Grid tools. For example, Condor provides “condor_submit” command for job submission, and OpenPBS does “qsub.” Therefore, to realize the high throughput computing system where users can easily manage their own jobs/tasks, we should provides a unified interface to accommodate differnces among Grid tools. 4.2. Requirements for User’s Operation At present, since a complete protein structure prediction method is not established and a lot of protein structure prediction methods exist, researchers/users of bioinformatics might use these prediction methods in combination and predict the structure by some cutand-try methods. Researchers/users need to frequently change the input parameters of a protein structure prediction method or even protein structure prediction method itself. However, users still define the execution order of methods and input parameters in the form of a batch script by using some script language such as Perl or shell script. Moreover, users can not easily change the execution order of methods, bacause the description of a batch script is likely to be complicated. The execution of methods by a batch script is not suitable for protein structure prediction, because each job of prediction methods requires enormous time to acquire the complete result. To solve such problem, some interactive application interface is required on which users can inspect arbitrary stage results and dynamically modify the input parameters or even the prediction methods.
K. Fujikawa et al. / Applying Grid Technology to ROKKY
Workflow management
33
Workflow management server
Workflow definition tool
workflow definition report of workflow execution
workflow submission
execution status
Workflow execution Meta job dispatcher
Computation nodes/clusters
Figure 3. Components of High Throughput Computing System.
5. High Throughput Computing System for a Protein Structure Prediction As mentioned above, we consider that an interactive application interface between users and the high throughput computing system is required. We have been designing and developing a new high throughput computing system where users define a workflow for a protein structure prediction. Here, we describe our high throughput computing (HTC) system in detail. 5.1. Workflow Design and Control Tool We have been implementing a workflow design and control tool on the HTC system for users to easily define the execution order of several methods and input parameters of each method. From now on, we call a protein structure prediction method and the execution order of methods, “work” and “workflow,” respectively. 5.2. Differential Execution of Workflow Our workflow design and control tool provides a differential execution mechanism of a workflow. On this mechanism, users can freely modify a workflow and re-execute the workflow at an arbitrary point. We consider that the diffrential execution of the workflow is required in the following cases: • Users find that the definition of a workflow which is on execution is wrong and want to change it dynamically. • Users want to change the work. • Users want to change the input parameters of each work. 5.3. System Architecture Our HTC system consists of the workflow management unit and the workflow control unit (see Figure 3). Users can define the execution order of methods for the protein structure prediction in the workflow management unit. When executing a workflow, the workflow design and control tool sends a execution command to the meta job dispatcher through the workflow management server. Then the meta job dispatcher selects apporo-
K. Fujikawa et al. / Applying Grid Technology to ROKKY Web Request
34
E-value < 0.001
PSI-BLAST E-value >= 0.001
3D-Jury score < 30.0
score >= 30.0
Loop Insertion
Fragment Generation Domain Parsing (D1, D2, ...) E-mail Reply FASA(D1) FASA(D2) n trials n trials
Domain Docking
Cluster Analysis Cluster Analysis
Model Selection Model Selection
Figure 4. Workflow of Protein Structure Prediction.
priate nodes for the workflow and jobs are executed on the nodes. When users modify the workflow, the difference of the workflow is generated in the workflow management server. Then the difference of the workflow is sent to the meta job dispatcher, and the modified workflow will be executed. 5.4. Applying to the Protein Structure Prediction System Here, we describe how to apply the HTC system to the Protein Structure Predictor called “ROKKY.” 5.4.1. Types of Work Figure 4 shows the process of a protein structure prediction in ROKKY, which consists of three methods (CM, FR, and NF) as mentioned in 3. We use “PSI-BLAST” and “3DJury” for CM and FR, respectively. For NF, we use “fragment generation,” “domain parsing,” and “FASA with SimFold.” Each job executed in all three methods is treated as “work” in our HTC system. Since the characteristic of each work varies, our HTC system assigns computational nodes for each work as follows: For both a CM work and an FR one, a single node will be dispatched. An NF work, available nodes will be dispatched, since SimFold executes a lot of tasks, each of which has different input parameters. The number of required nodes can be specified in a workflow. If the number of available nodes is less than the required one, our HTC system assigns nodes according to the strategy specified in the workflow. 5.4.2. GUI of Workflow Design and Control Tool Our workflow design and control tool provides a GUI (graphical user interface) for users to easily define a workflow (see Figure 5). On this GUI, users can perform the following operations: • defining/modifying of a work • defining/modifying of a workflow • specifying/modifying of input parameters of a work
K. Fujikawa et al. / Applying Grid Technology to ROKKY
35
Figure 5. GUI of the Workflow Design and Control Tool. 40
M-1 35
30
25
A-1 20
A-2
A-3
A-4
RMSD
M-2
15
M-3
10 C
5
0 0
20
40
60
80
100
120
140
160
180
Time (h)
Figure 6. Evaluation result of our HTC system.
• submitting of a workflow • terminating of a workflow • verifying of a status of a workflow in execution 5.5. Evaluation Here, we evaluate our HTC system. We use “Target T0198” as the input sequence of a target structure, which was submitted at CASP6, and compare two predicted structure with and without our HTC system. The execution time is about 164 hours. The result is shown in Figure 6. RMSD in this figure represents the similarity between a predicted structure of a protein and the answer structure. The lower the score, the more similar the structures of proteins. In this figure, A-1-4 are the predected structures without our HTC system, and B-1-4 are the ones with our HTC system. In the period of “C”, a user modified the workflow through our HTC system. Thus, using our HTC system can improve the result of the protein structure prediction. We consider that some measurement method is needed to evaluate a Grid system quantitatively as well as qualitatively. In the near future, we will evaluate some usability of our Grid system. Based on an evaluation, we will improve our Grid system.
36
K. Fujikawa et al. / Applying Grid Technology to ROKKY
6. Conclusions BioGrid project creates a huge virtual laboratory on Grid connected in high-speed network. We expect mutual cooperation among researchers who have high technical potential in IT, biology, and medical science fields. HTC group is an example of such cooperation of IT researchers and biological and medical scientists in BioGrid project. Currently, most Grid middlewares have extended their functionality, however, these scientists cannot still use Grid systems easily. The goal of HTC group is to realize an Grid environment which users can use without any special knowledge or technique. In this paper, we have proposed a new Grid system with a GUI tool where users can easily manage their own jobs/tasks. Also, we have shown an example how our Grid system supports a protein structure prediction system with a We believe that our Grid system can support be applied to other areas in biological and medical sciences.
Acknowledgements This research was performed through IT-program of Ministry of Education, Culture, Sports, Science and Technology. Also, this research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Priority Areas A05-13, 2004.
References [1] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman: Gapped BLAST and PSIBLAST: A New Generation of Protein Database Search Programs, Nucleic Acids Research, 25 (1997), 3389–3402. [2] D. Baker and A. Sali: Protein structure Prediction and Structural Genomics, Science, 294 (2001), 93–96. [3] G. Chikenji, Y. Fujitsuka, and S. Takada: A Reversible Fragment Assembly Method for De Novo Protein Structure Prediction, Journal of Chemical Physics, 119 (2003), 6895–6903. [4] Y. Fujitsuka, S. Takada, Z.A. Luthey-Schulten, and P.G. Wolynes: Optimizing Physical Energy Functions for Protein Folding, Proteins: Structure, Function, and Bioinformatics,54 (2004), 88–103. [5] K. Ginalski, A. Elofsson, D. Fischer, and L. Rychlewski: 3d-Jury: A Simple Approach to Improve Protein Structure Predictions, Bioinformatics, 19 (2003), 1015-1018. [6] S. Takada: Protein Folding Simulation With Solvent-Induced Force Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins, Proteins: Structure, Function, and Genetics, 42 (2001), 85–98. [7] 6th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction, http://predictioncenter.llnl.gov/casp6/Casp6.html. [8] BioGrid Project Web Page, http://www.biogrid.jp/. [9] Condor Project Homepage, http://www.cs.wis.edu/condor/. [10] Folding@Home Distributed Computing, http://folding.stanford.edu/. [11] OpenPBS, http://www.openpbs.org/. [12] ROKKY Protein Structure Suite, http://www.proteinsilico.org/rokky/. [13] SETI@home: Search for Extraterrestrial Intelligence at home, http://setiathome.ssl.berkeley.edu/.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
37
Grid-Based Onto-Technologies Provide an Effective Instrument for Biomedical Research Alexei JOUTCHKOV a, Nikolay TVERDOKHLEBOV b, Irina STRIZH c,1, Sergey ARNAUTOV b and Sergey GOLITSYN a a Telecommunication Centre “Science and Society”, Russia b Institute of Chemical Physics, Russia c Department of Biology, Moscow State University, Russia
Abstract. New experimental technologies have rapidly transformed biomedical research into a data-intensive discipline. Being grafted onto Grid environment ontologies deliver an effective “onto-technology” to explore wide variety of different sorts of links in heterogeneous distributed data sources as well as to define new facts and represent new relationships between data sets. Keywords. Grid, biomedical ontologies, data integration, knowledge management
Introduction Full-scale practical use of Grid technology in biology and healthcare is still a challenge to the Grid community. Gradually we build a new information environment where Grid technologies enable flexible and secure sharing of diverse resources, including computers, data and storage across more than 30 biomedical institutions in framework of Federal Scientific Project “New Generation of Vaccines and Medical Diagnostic Systems” (FSP). Calculation of alignment, docking and molecular dynamics are already performed on Grid [1]. Nowadays every biomedical study requires working with multitude of heterogeneous data resources, which cover genomic, cellular, structure, phenotype and other types of biologically relevant information as well as with clinical reports that are extremely important for healthcare. We suppose that the next milestone in biomedical knowledge management would be development of ontologies over the huge amount of distributed data sets that are accessible throughout Grid infrastructure and services. These ontologies would be an effective tool to define and represent in a formal way different sorts of relations between different sorts of data and to, assist in exploration of potential relationship of diverse data and facts.
1
Correspondence to: Irina Strizh, Department of Biology, Moscow State University, Leninskie Gory, Moscow, 119992, Russia; E-mail:
[email protected].
38
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
1. Ontology Design is Needed for Biomedicine Nowadays, owing to advances in high-throughput technologies, almost every biomedical study brought us large sets of data. This leads to accumulation of the huge amount of new data in addition to existing ones. The problem is practically all biomedical data sources are distributed, heterogeneous and dynamic, and today we have no useful tools for navigation in these various information sources. It becomes obvious that effective elicitation of implicit knowledge is necessary for the extraction of certain data from these sources. The technologies aimed to overcome this problem by adding meaning (ontologies, annotations) were borrowed from Semantic Grid. Information Technology has adopted the term “ontology” from the philosophy where it means the study of being and mutated it to the more technical definition: “specification of a conceptualization” [2]. In essence ontology could be defined as a domain of knowledge, represented by facts and their logical connections, structured through formal rules so that it can be understood, i.e. interpreted and used by computers [3]. In such terms, ontologies are becoming increasingly important in biomedical studies because they can be linked to the information in databases. Recently, ontologies were claimed to be the solution of the problem for data integration in molecular biology and bioinformatics [4]. Additionally, ontologies can articulate and make generally accessible in a formal and structured way the large amounts of hard-won biological knowledge normally stored in textbooks and research papers. Once converted into the ontology this knowledge will be also linked to databases and then could be used for semantic search. Search through ontologies can be much faster and less subject to ambiguity than string search. Thus, development of technologies, which will enable experts to work with distributed sources of information and to formalize their knowledge into “machine-interpretable” ontologies, is essential for biomedicine. Creation and converting of existing ontology development tools into Grid services allow to involve Grid computing and other Grid technologies in ontology design.
2. Systems Biology as the Next Step in Biomedicine Requires Ontologies Successful physiological analysis requires an understanding of the functional interactions between the key components of cell, organs, and systems, as well as how these interactions change in disease states [5]. Integration and inspection of genome databases alone will not get us very far in understanding of problems in healthcare. We can find a gene code for protein sequences and, probably, we will find its functions. But the interactions between proteins and other cell molecules and organelles that generate function are not presented in most databases. Several databases which cover known protein-protein interactions [DIP], DNA-protein interactions [TRANSFAC], metabolic pathways [EcoCyc, KEGG], gene networks [GenNet] and cell cycle [CycloNet] are being developed very intensively last years. It is very important to collect and analyze such kind of data which must help to understand how the whole organism work well enough to undertake truly predictive treatment (in the case of human body), or engineering, morphological and chemical composition (in the case of single processes, cells and model organisms). Thus, a new scientific approach has arisen recently in biology the Systems biology. Systems biology explores biological systems by systematically perturbing them (biologically, genetically, or chemically); monitoring the gene, protein, and informational pathway responses; integrating these data; and ultimately, formulat-
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
39
ing mathematical models that describe the structure of the system and its response to individual perturbations [6]. The aim of the systems biology is to achieve a more comprehensive view of the functional components of cells and of entire organisms including their development by predicting their properties from numerical data that arise from interaction analyses of many system elements. Ontologies satisfy all the requirements of the fundamental framework for systems biology which were proposed earlier [6]: • • •
•
conceptualization of the knowledge domain is identical to the necessary “define all of the components of the system” in systems biology; ontology improving by addition of new data links is alike to the “systematically perturb and monitor components of the system”; ontology-based prediction and data integration may be compared to the “reconcile the experimentally observed responses with the predicted by model” in systems biology; refined ontology with new concepts and relations could serve as desired “design and perform new perturbation experiments to distinguish between multiple or competing model hypothesis”.
Thus, we propose to use ontologies as an effective mechanism for data integration, modeling and trait prediction in systems biology as well. The most important and characteristic feature of systems biology, which allows us to propose to use Grid for it, is that its objects are represented by sets of complicated, distributed and numerous data that must be analyzed systematically. For example, discovery approaches are providing the complete sequences of the 24 different human chromosomes [7] and of 20 distinct mouse chromosomes. These sequences offer a number of powerful opportunities. Nowadays, new technologies and software allow identifying the gene locations and coding regions embedded in a sequenced genome. No doubt, involvement of the Grid technologies, which provides distributed computer analysis, accelerates the identification process. Comparative analysis of these coding regions reveals a lexicon of motifs and functional domains that is essential to solve the protein-folding and structure/function problems. Grid resources could help to solve effectively all these tasks of systems biology. Moreover, genomic sequence provides access to the adjacent regulatory sequences — a vital component to solve the regulatory code and open access to polymorphisms, some of which are responsible for differences in physiology and disease predisposition [6]. The immediate challenge of systems biology is to place all these known components in the context of their informational pathways and networks. Ontologies in Grid environment could help scientists to solve this task. At first sight, it seems strange, but for better healthcare we do must collect and understand not only processes occurring in the human body or cells, but also in plants! Things became clearer, when we will remind, that both plant cell and tissue culture are currently the perspective approaches for production of new medicines. New technologies allow obtaining multiplicity of plant cell metabolites, which could be used for medical purposes. The major challenge, however, remains the integration and understanding of the extensive information on the various levels of cellular and developmental processes in human as well as in plants. Ontologies are among the most reasonable ways to capture knowledge within distributed sources of information, capable of constructing, formalizing and interchanging expert’s knowledge.
40
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
Figure 1. A simple hierarchy ontology (on the left panel).
3. Creation of Immuno-Ontologies for FSP Virtual Organization Ontology is intended to be explicit and complete. On the other hand, experience from the long history of classification systems in the biomedical domain shows that manually created classifications are inflexible and hard to manage when they become large, detailed and multi-axial [8]. A key problem originates in the fact that many terms are combinatorial cross products of orthogonal terms. Thus, several FSP experts initiated to build domain- and task-oriented ontologies including not more than couple hundreds of concepts. Since FSP virtual organizations include scientists with special interests from various organizations, the number of distinct ontologies is collected. The main topics of constructed ontologies are ‘Immunomodulators’, ‘Antigenes’, and ‘Immunity factors’. In spite of relatively small content (e.g., Gene Ontology involves more than 16000 concepts), these ontologies, no doubt, also represent the formal conceptualization of a particular domain of knowledge. The problem, which was arisen, is that independent scientists represented their personalized view of the related field, thus constructed their ontologies by different rules. As the result, in some ontology relations between concepts (represented by controlled vocabulary terms) are similar to typical hierarchical structure (Figure 1) whereas in others the relations are represented by the use of Direct Acyclic Graphs (DAG). The ontology of immunity factors has more complicated structure. The relations between concepts characterize the processes of cell differentiation and interaction. Thus, the undirected rules (e.g. modulate, attached to) were used by design of it. As the result, cyclic paths were given through the graph (Figure 2). Such kind of a schema is quite typical for researcher who represents their ideas by sketching
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
41
Figure 2. An ontology as the Cyclic Graph.
a diagram, except great care must be taken to avoid confusion and ambiguity in the case of requirements of “computer understandable” interpretation of ontology. To avoid such problems a certain methodology and special ontology-development tools must be used. To make a machine-interpretable specification the expert’s conceptualization should be represented in a knowledge representation language. Development of the Grid services enabling experts to build their ontologies in standardized way and convert them into the formal descriptions should be the one of the task for the Grid Community.
4. Grid for Working out New Medicine from Unknown Peptide There are several problems that must be solved to obtain even a single new medicine. Nowadays, several different strategies are used for working out a new medicine. In FSP the one of the complicated, but very promising approaches is in progress. It is founded on the experimental finding of the substance, which could modulate immune response. For instance, peptides are known as molecules with plenty of functions, including immunomodulation activity. Nowadays, high technologies exist for chemical synthesis of peptides, but drug design could not always be helpful in predicting of the chemical structure and physiological function. Thus, it is necessary to find peptides with immunomodulation functions directly from cells. This work includes several steps from peptides extraction, through understanding of their origin, to analyze their physiological functions (Figure 3). At the first step, researches gain sets of sequence data, which need to be compared with known expression motifs. Thus, BLAST and motif searches must
42
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
Figure 3. Steps of working out a new medicine based on peptide analysis.
be used on base of established sequences. Moreover, the information of gene function is primary contained in the articles indexed in the Medline database, so it is necessary to use information extraction methodology as it was proposed, for example in [9]. To predict the physiological function of the peptides, their peptide-protein interactions also must be analyzed. The final step, which include clinical tests also provide multiple data, which must be analyzed carefully. It is obvious, that huge amounts of data sets must be examined to solve all this tasks. Involvement of the Grid technologies could accelerate the process of data analyzes and provide considerable time economy. We are constructing sets of Grid services, which should help scientists to solve mentioned problems. For the navigation in data space we used several ontologies created by FSP experts as well as the Gene Ontology (GO) that provides today the most important, well-designed and almost complete ontology in molecular biology [10]. GO has started to produce a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in different species. Unless, sometimes it might be difficult for a user to browse and find the exact category or term within more than 16000 GO concepts. Therefore, we have developed new service in Grid environment of FSP, which provide users the ability to extract the part of the GO tree: users are able to operate only with those concepts (and their relations), which are in need for the present study. As the result, proposed prototype of Grid services, would not only accelerate, but should also improve design of peptide medicine in FSP.
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
43
5. From Ontology-Based Plant Cell Metabolism Modeling to New Medicines Another strategy used for working out new medicines includes development of new technologies that provide ability to gain the certain substance with known functions. For instance, there is a multitude of secondary metabolites of various plants, pharmacological activity of which were established. Extracts of several plants are well known for a cancer, neuralgic, rheumatism, vascular, gynecological and other diseases treatment. Thus we must gain the ability to design biotechnologically important cell cultures in a predictable way. To do this, it is necessary to gain a comprehensive knowledge of how plant cells function. We should develop our understanding of plant cell metabolism to such an extent that we can precisely predict by modeling how a plant cell chemical composition will respond to any given genetic or environmental perturbation. Predictive metabolic engineering requires building metabolic network with information about several pathways, including pathways of secondary metabolism [11]. Obviously, the analysis of cellular systems and networks requires extensive use of new biochemical, genetics and biotechnological methods as well as of bioinformatics resources for data management, mining, modeling and many other tasks. The general systems analysis process can be divided into four stages of system understanding that distinguishes between the identification of system structure, the analysis of its behavior, development of new system control and design strategies [12]. It is an iterative process of analyzing and modulating cellular properties to continuously improve our models. An important question arises whether the current bioinformatics infrastructure is sufficiently prepared for the new requirement of systems-based approaches in biology and medicine because the data sets in these fields are of unprecedented complexity and diversity. Even for experts skilled in Bioinformatics, to organize large data sets from different on-line sources is currently not an easy task largely due to the lack of standards for common data formats and limited interoperability of public databases. We propose to use Grid and ontologies to manage these multidimensional data sets like genome-wide studies of protein-protein interactions, subcellular protein and/or metabolite localizations, single nucleotide polymorphism, gene knockout effects, compound libraries, and associated phenotype data of chemical genetics screens. Nowadays, we are working on the first step of this job: an ontology of reactive oxygen species (ROS) signaling pathways in plants is under development.
6. Ontology Design to Understand and Predict ROS Signaling and Redox Networks The significance of the ROS generation in plant cell development, redox signaling and stress tolerance is putting forward in many recent reviews. The wide range of plant responses triggered by hydrogen peroxide, by hydroxyl radical and even by singlet oxygen is well known today [13]. ROS also influence on the variety metabolism reactions, thus ROS supposed to be involved in the biosynthesis of variety plant metabolites, which could be used as powerful medicine. Redox regulation of genes in plant cell occurs at multiple expression levels, thus suggests the existence of a complex signaling networks. Even in the case, when ROS induces destructive processes, it also may result in initialization of signal transduction if protein damage will lead to production and accumulation of peptides that could mediate signal in cell. Therefore, there are huge and rather distributed amount of recent
44
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
data, ideas and publications as well as the oldest ones concerned to the ROS production and its significance in plants. Moreover, to operate with the knowledge of variety ROSgenerating, ROS-processing, ROS-scavenging and ROS-induced mechanisms in cells we require the sequence comparison analyses to link one molecular process to another. In spite of many reports, actual state of knowledge in ROS signaling represents a collection of single observations that have not yet provided a complete picture of cellular redox signaling. The integration of these data into a coherent model will be one of the great challenges in the future. We propose that building and using of ontologies must be the one of the most reasonable ways to capture knowledge concerning to the ROS signaling within distributed and heterogeneous sources of information. Ontology-based knowledge representation for ROS processing in plant cells must help not only gather multiple data from different sources, but will also provide a powerful tool for predicting and modeling various processes in plants. Involvement of the Grid computing should improve building of ontologies owing to more accelerate and full linking of variety data. Such ontologies as Gene Ontology (GO), Plant Ontology (PO) and Trait Ontology (TO) can serve as the excellent examples of ontology-based profit in plant science. Sometimes they are very useful and maybe helpful in several experiments. Unfortunately, GO has several imperfections and the main its use is a controlled vocabulary for conceptual annotation of gene product function, process and location in databases. TO also represents controlled vocabulary to describe each trait as a distinguishable feature, characteristic, quality or phenotypic feature of a developing or mature individual. PO is a vocabulary of plant morphology, anatomy representing tissue and cell types as well as growth and developmental stages in various plants, but it available only for 6 gramene plants (rice, maize, sorghum, wheat, oat and barley). So, development of ontologies which will represent a conceptualization of a community’s knowledge of a domain, particularly dealing with ROS processing and redox signaling in plant cell is still highly desirable. It is relatively easy to design an ontology based on concrete facts with well-built relations such as taxonomy definitions. It is more difficult to design an ontology based on knowledge that is incomplete, or poorly understood or lacking unanimous support. In GO and PO and most of others bio-ontologies concepts are represented by controlled vocabulary terms. It is very important to use terms that are internationally acceptable and understandable. Thus we have started to develop the “ROS signaling ontology” from the choosing and gathering of the concepts, with subsequent finding and writing out a unique and explicit definition for each of them. Most of genes and their products have been already described in GO vocabulary. Developed by us software let us to extract several concepts, their definitions and their relations from GO. This function improves and accelerates the collection of the concepts several times. The next step was to create relations between concepts. This process usually is carrying out by experts in the certain field. Nevertheless, it could be automated at least partially if scientists could find and formulate the rules which define these relationships. The most important aspect of the assertions and rules that define ontology is that they can be used to make logical inferences about the terms and their associated properties. One of the controversial questions is how these relationships must be represented. Sometimes in biology, particularly in the case of tree-based taxonomy or evolution tree, the relations could be represented as simple hierarchical structure where each term has a single parent. In GO and PO the relationships between concepts are represented via DAGs. A DAG is similar to a hierarchical structure but is more superior
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
45
because terms can have more than one “parents”. No doubt, that DAGs represent biological relationships better than typical hierarchical structures do and we also began to represent relations between concepts via DAG. Nevertheless, it has become obvious soon that for complete knowledge representation of signal pathways in cell sometimes we must choose undirected rule for the relation creation. This makes the whole ontology more complicated and ‘less logical’ at first sight, but to do this is essential obligation because of existing of several biochemical and molecular reactions that could be regulated by ‘feed-back’ mechanism: the product of the reaction modulate the activity of the initial substances. We proposed, that ontologies of biological knowledge domain especially concerning to metabolite and signaling pathways could not be fixed only in ‘directed’ rule. We have to develop ontologies as flexible structures that represent all complexity of the natural cell and organism processes. Though it will be rather difficult to analyze and work with such kind of complicated knowledge representation, if we want to understand the whole life we must do it. Development and usage of new computing approaches are become crucial when data are to be interpreted at a systems level. Grid technologies enable the constructing and operating with complex ontologies. Developed Grid services for ontology construction let us to extract, interact and accumulate data and knowledge regarding ROS formation, scavenging and signaling in various plant compounds, cells and species, as well as to compare plant cells ROS-signaling mechanisms with similar in mammalian cells and bacteria. Of course, this prototype ontology, like any other model may have some deficiency. No doubt that improvement of this ontology will be the next step to comprehensive understanding of ROS signaling networks and their involvement in metabolite biosynthesis in plants.
Conclusion One of the main current challenges in biomedical research is unprecedented complexity and diversity of data sets that should be analyzed jointly. Ontology-based Grid technologies provide a new multidimensional information environment and powerful tools for collaborative data handling and analysis as well as for cooperative constructing of complex formalized Corporative Knowledge. A combination of these abilities brings biomedical research to a new technological level.
References [1] A. Joutchkov, et al. Local libraries of strategies in Grid resources management. In Proc. of the II International Conference "Parallel Computations and Control Problems" PACO '2004. Moscow, Oct. 4-6, 2004. ISBN 123456. [2] T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 1993, 5(2): 199220. [3] J. Bard. Ontologies: formalizing biological knowledge for bioinformatics. BioEssays, 2003, 25: 501506. [4] J.B.L Bard, S.Y. Rhee. Ontologies in biology: design, application and future challenges. Nature, 2004, 5: 213-222. [5] D. Noble. Modeling the heart- from genes to cells to the whole organ. Science, 2002, 295:1678-1682. [6] T. Ideker, T. Galitsky, L. Hood. A new approach to decoding life: systems biology. Annual. Review Genomics Human Genetics, 2001, 2:343-372. [7] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, et al. The sequence of the human genome. Science. 2001, 291:1304–51.
46
A. Joutchkov et al. / Grid-Based Onto-Technologies Provide an Effective Instrument
[8] A.L. Rector. Clinical terminology: Why is it so hard? Methods of Information in Medicine, Schattaue. ISSN 0026-1270, 1999, 38: 239–252. [9] C. Blaschke, A. Valencia. Automatic ontology construction from the literature Genome Informatics. 2002, 13:2001-213. [10] The Gene Ontology Consortium: Creating the Gene Ontology resource: design and implementation.. Genome Research, 2001, 11: 1425-1433. [11] L.J. Sweetlove, R.L. Last, A.R. Fernie. Predictive metabolic engineering: a goal for systems biology. Plant Physiology, 2003, 132: 420-425. [12] H. Kitano. Looking beyond the details: a rise in system-oriented approaches in genetics and molecular biology. Current Genetics 2002, 41:1-10. [13] C. Laloi, K. Apel, A. Danon. Reactive oxygen signaling: the latest news. Current Opinion Plant Biology, 2004, 7:323–328.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
47
Ontology–Based Knowledge Repository Support for Healthgrids Alexander SMIRNOV, Mikhail PASHKIN, Nikolai CHILOV and Tatiana LEVASHOVA St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Abstract. Healthgrids unite a large amount of independent and distributed organisations to provide for various healthcare services. Often the involved organisations can belong to different areas of healthcare and even different countries. However to achieve efficient operation they have to act in a well coordinated manner. As a result, an efficient knowledge sharing between multiple participating parties of the healthgrid is required. The paper describes application of an earlier developed ontology-driven KSNet (Knowledge Source Network) – approach to knowledge repository support for healthgrids. This approach is based on representation of knowledge via ontologies using formalism of object-oriented constraint networks. Such representation makes it possible to define and solve various tasks from the areas of management, planning, configuration, etc., by using constraint solving engines such as, for instance, ILOG or CLP. The major discussed aspects cover the formalism of knowledge representation via ontologies and implementation of the approach as a decision support system for a case study from the area of health service logistics. Keywords. Ontology management, object-oriented constraint networks, health service logistics
1. Introduction The technology of grid is based on sharing, selection, and aggregation of distributed resources based on their availability, capability, performance, cost, and quality-ofservice requirements. Thereby, healthgrids can be considered as coalitions of loosely associated groups of participating parties (organisations, people, tools, etc.), each with its own level of commitment to the coalition in which it participates, each with its own agenda, and each engaged in a limited role within the coalition. Often the involved parties can belong to different areas of healthcare and even different countries. However to achieve efficient operation they have to act in a well coordinated manner. As a result, it can be seen that to manage such a coalition an efficient knowledge sharing between the participating parties is required. This knowledge must be pertinent, clear, correct, and it must be timely processed and delivered to appropriate locations. One of scientific directions addressing the above issue is Knowledge Logistics (KL) [1]. It stands for acquisition of the right knowledge from distributed sources, its integration and transfer to the right person within the right context, at the right time, for the right purpose. Presented in the paper approach to KL is called “Knowledge Source Network” (KSNet-approach). It utilizes principles of artificial intelligence implying
48
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
Figure 1. KSNet-approach: distributed multi-level knowledge logistics as a KS network configuration.
synergistic use of knowledge from different sources in order to complement insufficient knowledge and to obtain new knowledge. The approach is based on such advanced technologies as open services, intelligent agents, etc. Since KL assumes dealing with knowledge contained in distributed and heterogeneous sources, the approach is oriented to ontological model providing for a common way of knowledge representation to support semantic interoperability. The rest of the paper describes implementation of the KSNet-approach as a decision support system for coalition health service logistics operations. The choice was motivated by the fact that this topic presents numerous challenges in such different areas as resource management, logistics, and other. 2. Ontology-Driven KSNet-Approach 2.1. Overview The KL problem in the here presented approach is considered as a configuration of a network including end-users, knowledge resources, and a set of tools and methods for knowledge processing located in the network-centric environment. Such network of loosely coupled sources will be referred to as a knowledge source network or “KSNet” (detailed description of the approach can be found in [2,3]), and the approach is called KSNet-approach. The approach is built upon constraint satisfaction/propagation technology for problem solving since application of constraint networks allows simplifying the formulation and interpretation of real-world problems that are usually presented as constraint satisfaction problems in such areas as management, planning, configuration, etc. (e.g., [4]). ILOG [5] has been selected as a constraint satisfaction/propagation technology for the approach. Figure 1 explains roughly basic concepts of KSNet-approach and multi-level knowledge source network configuration. The upper level represents a customeroriented knowledge model based on a fusion of knowledge acquired from network units that constitute the lower level and contain their own knowledge models. A detailed description of the approach can be found in [1]. 2.2. Knowledge Representation Formalism As a general model of ontology representation in the system “KSNet”, an objectoriented constraint network paradigm (Figure 2) was proposed [1]. This model defines
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
49
Figure 2. Object-oriented constraint network paradigm.
the common ontology notation used in the system. According to this representation an ontology (A) is defined as: A = (O, Q, D, C), where: O – a set of object classes (“classes”); each of the entities in a class is considered as an instance of the class. Q – a set of class attributes (“attributes”). D – a set of attribute domains (“domains”). C – a set of constraints. For the chosen notation the following six types of constraints have been defined C = CI ∪ CII ∪ CIII ∪ CIV ∪ CV ∪ CVI: CI = {cI}, cI = (o, q), o∈O, q∈Q – (class, attribute) relation; CII = {cII}, cII = (o, q, d), o∈O, q∈Q, d∈D – (class, attribute, domain) relation; CIII = {cIII}, cIII = ({o}, True ∨ False), |{o}| ≥ 2, o∈O – classes compatibility (compatibility structural constraints); CIV = {cIV}, cIV = 〈o', o'', type〉, o'∈O, o''∈O, o' ≠ o'' – hierarchical relationships (hierarchical structural constraints) “is a” defining class taxonomy (type=0), and “has part”/“part of” defining class hierarchy (type=1); CV = {cV}, cV = ({o}), |{o}| ≥ 2, o∈O – associative relationships (“one-level” structural constraints); CVI = {cVI}, cVI = f({o}, {o, q}) = True ∨ False, |{o}| ≥ 0, |{q}| ≥ 0, o∈O, q∈Q – functional constraints referring to the names of classes and attributes. In order to process information contained in heterogeneous knowledge sources a mechanism supporting import of source knowledge representation is provided for. At present the import from OWL representation language into the internal representation is available. Summary of the possibility to convert knowledge elements from OWL into the internal format of the system “KSNet” is presented in Table 1.
3. The Case Study 3.1. Coalition Health Service Logistics Health service logistics support in healthgrids presents numerous challenges due to a variety of different policies, procedures and practices of the members of the operations, e.g., difference in doctrine, logistics mobility, resource limitations, differing stockage levels, interoperability concerns, competition between participants for common support.
50
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
Table 1. Possibility to convert knowledge elements from OWL into the notation of object-oriented constraint networks. Element Groups
Elements form OWL
Elements supported by the notation of object-oriented constraint networks
Class complementOf () DeprecatedClass DeprecatedProperty disjointWith equivalentClass equivalentProperty maxCardinality
minCardinality Nothing Ontology priorVersion Restriction Thing versionInfo
Elements weakly supported by the notation of objectoriented constraint networks
allValuesFrom AnnotationProperty cardinality DataRange DatatypeProperty
FunctionalProperty incompatibleWith InverseFunctionalProperty OntologyProperty unionOf
Elements not currently supported by the notation of object-oriented constraint networks Elements not currently supported by the notation of object-oriented constraint networks and such support requires additional research
hasValue Imports inverseOf
ObjectProperty onProperty
AllDifferent backwardCompatibleWith differentFrom distinctMembers intersectionOf
one of sameAs someValuesFrom SymmetricProperty TransitiveProperty
In [6] six major principles of joint activities logistics applying to operations other than war are selected. These principles may equally apply to healthgrids when the latter are considered as coalitions. They include: –
– – –
– – –
Objective. There must be a clearly defined, decisive and attainable objective, and all the efforts of each coalition member have to be integrated into the total effort of achieving strategic aims and cumulating in the desired end state. Unity of effort. There must be a close coordination of all the members provided leading toward the main goal and every subgoal. Legitimacy. Legitimacy involves sustaining the people’s willingness to accept the right of the leader to make and carry out decisions so that their activities would complement, not detract from, the legitimate authority of the leaders. Perseverance. In coalition operations strategic goals may be accomplished by long-term involvement, plans, and programs. Short duration operations may occur, but these operations have to be viewed as to their impact on the long-term strategic goals. Restrain. Coalitions put constraints on potential actions that can be undertaken by the members to achieve their goals. Security. Security is a very important issue in coalition operations, especially in those related to healthcare. The operation’s leaders and members have to ensure that they include security measures.
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
51
Table 2. Ontologies used for building “hospital configuration” application ontology. Ontology
Format
Clin-Act (Clinical Activity), the library of ontologies [7]
KIF
Upper Cyc/HPKB IKB ontology with links to SENSUS, Version 1.4 [8]
Ontolingua (KIF)
Loom ontology browser, Information sciences Institute, The University of Southern California [9]
Loom
North American Industry Classification System (NAICS) code, DAML Ontology Library [10]
DAML
The UNSPSC Code (Universal Standard Products and Services Classification Code), DAML Ontology Library, Stanford University [11]
DAML
Web-Onto [12]
OCML
Healthcare coalitions may have different missions. E.g., they can be related to disaster relief, evacuation, humanitarian assistance, and other. For the project tasks of health service logistics considering mobile hospital configuration and evacuation has been chosen. 3.2. Hospital Ontology Creation The experimentation with the below scenario is intended to demonstrate how the developed KSNet-approach can be used for support of coalition-based operations. The following request was considered: Define suppliers, transportation routes and schedules for building a hospital of given capacity at given location by given time. An application ontology (AO) of this task was built based on existing ontologies (Table 2) and connection of the found sources is performed. After this the request can be processed. Being a context dependent conceptual model that describes a real-world application domain depending on a specific user request and relevant to its particular domains and tasks, AO plays a central role in the request processing and also represents joint knowledge of the user and knowledge sources. An analysis of the built AO showed a necessity of finding and utilizing knowledge sources containing the following information/knowledge: – – – –
hospital related information (constraints on its structure, required quantities of components, required times of delivery); available suppliers (constraints on suppliers’ capabilities, capacities, locations); available providers of transportation services (constraints on available types, routes, and time of delivery); geography and weather of the region (constraints on types, routes, and time of delivery, e.g. by air, by trucks, by off-road vehicles);
Created AO (Figure 3) can be later used to solve other tasks of the same nature in the region. Example requests can be as follows:
52
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
Figure 3. AO for hospital configuration problem.
Figure 4. Application ontology slices for different subtasks.
− by what time a hospital/camp… of given capacity at given location can be built? − where is it better to built a hospital/camp…? − find the best route to deliver something from point A to point B., etc. As a result of the analysis of these problems a number of modules (subproblems) were defined. The subproblems are described by parts (slices) of the AO (Figure 4). The following notation is used in the figure: bold label denotes the common application ontology, regular labels denote slices related to the subtasks and italic labels denote example parameters common for two or more slices (subtasks). The slices are, in turn, passed to the adaptive knowledge fusion agent for solving. As an example the following subproblems can be considered: –
–
Resource Allocation. This subproblem deals with finding the most efficient components for the hospital considering such factors as component suppliers, their capacities, prices, transportation time and costs and decision maker's choice and priorities. It is solved using ILOG Configurator. Routing. This subproblem is devoted to finding a pareto-optimal set of routes of deliv-
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
53
ery of the hospital's components from chosen suppliers considering such factors as communications facilities (e.g., locations of airports, roads, etc.), their conditions (e.g., good, damaged or destroyed roads), weather conditions (e.g., rains, storms, etc.) and decision maker's choice and priorities. This task was solved using ILOG Dispatcher. This tool is specially designed for solving transportation related tasks. 3.3. Context-Based Problem Solving Selection of the subproblems from the AO is organized on the basis of the context management technology. It deals with identification of context relations, what enables context arrangements in repositories for making repeatable and experience-based decisions. Context is defined as any information that can be used to characterize the situation of an entity, where an entity is a person, place or object that is considered relevant to the interaction between a user and an application, including the user and application themselves [13]. The main purpose of a context is the provision the operational decision support system with timely, accurate, directly usable, and easily obtainable information provided by the dynamic environment. The approach is based on the following methodology. Abstract context composition. Abstract context is formed automatically (or reused) applying ontology slicing and merging techniques based on ontology relations. The purpose of it is to collect and integrate knowledge relevant to the current subproblem into a context. Operational context composition. Knowledge sources related to the abstract context provide the required information instantiating the context. As the result of that a concrete description of the subproblem is specified by data values. Generation and presentation of results. Depending on the problem definition an operational context either formalizes the problem or describes the current situation. The problem/current situation is solved by a solver as a constraint satisfaction problem, and the result is displayed to the decision maker. In the latter case, the current situation is presented in a human readable and understandable form.
4. Case Study for Future Work The chance to survive for wounded is currently more than twice higher than before: 7.4 injured service members for every killed in 2004 Iraq war against 3.2 in 1991 Gulf war [14] One of possible ways to increase these chances even higher is to develop an efficient evacuation system. The problem of evacuation operation planning & management is considered as a problem for the future case study. This problem is very complex and includes tasks from such areas as logistics, diagnosing and other and thereby its solving will require intensive usage of knowledge. Importance of intelligent systems in the area of management has been widely recognized recently, particularly, in the context of decision making. The goal of this case study is to produce efficient plans for treatment and evacuation of injured people based on information available in different sources. The proposed scenario is described below and represented in Figure 5. Information about the patient (location, injury, time of injury, current condition) and some specific data stored in his/her microchip (such as ID, diseases such as diabetes or allergy, etc.) are transferred to the system. Besides this information the systems
54
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
Figure 5. Proposed scenario for evacuation operation planning & management.
acquires additional personal information (disease history, regular drug prescriptions, etc.) and knowledge required for diagnostics and possible courses of treatment and drug prescriptions from available sources. The information from the sources is extracted in a particular context (patient location, type of injury, current condition, etc.) and can be of certain probability. Local mobile ambulatory, hospital and evacuation facilities provider(s) supply actual information about their facilities and current and possible future capacities. Weather conditions forecasts and region geography are also required. Most of this information will be also uncertain and assigned with probability. Based on all these information a probabilistic constraint satisfaction model will be built. The objective will be to reduce the probability of the patient’s death by making a decision about his/her treatment and defining the appropriate schedule of the evacuation operation depending on the current condition of the patient. Possible decisions are: – –
treat the patient at site (e.g., in case of light injury); send mobile ambulance to the site and treat the patient there (e.g., when injury is not very dangerous and/or transportation to hospital is not possible due to weather conditions);
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
– – –
55
help the patient at site and then send a helicopter to transport him/her to the hospital (e.g., for further surgical intervention); send a helicopter to transport the patient to the hospital immediately (e.g., for urgent surgical intervention); send a helicopter with required equipment and medical supplies to the mobile ambulatory for stabilization of the patient’s condition at site.
5. Conclusion The paper describes application of a developed earlier ontology-driven approach to knowledge repository support for healthgrids. The scalable architecture of the approach enables its extension with regard to knowledge/information sources number and, thereby, in regard to factors taken into account during complex problem solving. Utilizing ontologies and compatibility of the employed ontology notation with modern standards (such as OWL) allows integration of the approach into existing processes and facilitates knowledge sharing with similar systems. Application of constraint networks allows rapid problem manipulation by adding/changing/removing its components (objects, constraints, etc.) and usage of such existing efficient technologies as ILOG. In [15] the third (currently developing) generation of Grids is characterised by its holistic nature and is based on open standards and such technologies as intelligent agents and Web/Grid services. The trend of involving richer semantics into the Grid caused appearance of so-called Semantic Grid. Presented here approach based on the technologies of intelligent agents, open services (a detailed description can be found in [2]), ontology management and Semantic Web open standards (such as OWL) tightly correlates with the third generation of Grids. The authors believe that the approach and its implementation can contribute to development of future generation healthgrids.
Acknowledgements Some parts of the research were done as parts of the partner project with CRDF sponsored by US ONR and US AFRL & EOARD, project # 2.44 of the research program “Mathematical Modelling and Intelligent Systems” & project # 1.9 of the research program “Fundamental Basics of Information Technologies and Computer Systems” of the Russian Academy of Sciences, and grant # 02-01-00284 of the Russian Foundation for Basic Research. Some prototypes were developed using software granted by ILOG Inc. (France).
References [1] A. Smirnov, M. Pashkin, N. Chilov, T. Levashova, and F. Haritatos. Knowledge Source Network Configuration Approach to Knowledge Logistics. Int. J. of General Systems. Taylor & Francis Group, 32(3): 251-269, 2003. [2] A. Smirnov, M. Pashkin, N. Chilov, T. Levashova, and A. Krizhanovsky. Knowledge Logistics as an Intelligent Service for Healthcare. Presentation at the Healthgrid 2004 Conference, January, 29-30, Clermont-Ferrand, France, 2004.
56
A. Smirnov et al. / Ontology–Based Knowledge Repository Support for Healthgrids
[3] A. Smirnov, M. Pashkin, N. Chilov and T. Levashova. Knowledge logistics in information grid environment. The special issue "Semantic Grid and Knowledge Grid: The Next-Generation Web" (H. Zhuge, ed.) of Int. J. on Future Generation Computer Systems, 20(1): 61-79, 2003. [4] H. Baumgaertel. Distributed constraint processing for production logistics. IEEE Intelligent Systems, 15(1): 40-48, 2000. [5] ILOG corporate Web-site, 2004. URL: http://www.ilog.com. [6] JTTP for Health Service Logistic Support in Joint Operations, 1997. URL: http://www.dtic.mil/doctrine/jel/new_pubs/4_02_1.pdf. [7] Clin-Act (Clinical Activity), The ON9.3 Library of Ontologies: Ontology Group of IP-CNR (a part of the Institute of Psychology of the Italian National Research Council (CNR)), 2000. URL: http://saussure.irmkant.rm.cnr.it/onto/. [8] Hpkb-Upper-Level-Kernel-Latest: Upper Cyc/HPKB IKB Ontology with links to SENSUS, Version 1.4, 1998. Ontolingua Ontology Server. URL: http://www-ksl-svc.stanford.edu:5915. [9] Loom ontology browser, Information sciences Institute, The University of Southern California, 1997. URL: http://sevak.isi.edu:4676/loom/shuttle.html. [10] North American Industry Classification System code, DAML Ontology Library, Stanford University, 2001. URL: http://opencyc.sourceforge.net/daml/naics.daml. [11] The UNSPSC Code (Universal Standard Products and Services Classification Code), DAML Ontology Library, Stanford University, 2001. URL: http://www.ksl.stanford.edu/projects/DAML/UNSPSC.daml. [12] WebOnto: Knowledge Media Institute (KMI), The Open University, UK, 2003. URL: http://eldora.open.ac.uk:3000/webonto. [13] A. Dey, D. Salber, G. Abowd. A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications. Context-Aware Computing, A Special Triple Issue of HumanComputer Interaction (T.P. Moran, P. Dourish, eds.), Lawrence-Erlbaum, 16, 2001. http://www.cc. gatech.edu/fce/ctk/pubs/HCIJ16.pdf. [14] USA Today, 2004. URL: http://www.usatoday.com. [15] D. De Roure, M. Baker, N. Jennings, and N. Shadbolt. The Evolution of the Grid. Grid Computing Making the Global Infrastructure a Reality (Berman, F., Fox, G. and Hey, A. J. G., Eds.), John Wiley and Sons Ltd., 65-100, 2003.
Part 2 Deployment of Grids in Health
This page intentionally left blank
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
59
Deployment of a Grid-Based Medical Imaging Application S. Roberto AMENDOLIA a, Florida ESTRELLA b, Chiara DEL FRATE c, Jose GALVEZ a, Wassem HASSAN b, Tamas HAUER a,b, David MANSET 1a,b, Richard McCLATCHEY b, Mohammed ODEH b, Dmitry ROGULIN a,b, Tony SOLOMONIDES b and Ruth WARREN d a TT Group, CERN, 1211 Geneva 23, Switzerland b CCCS Research Centre, Univ. of West of England, Frenchay, Bristol BS16 1QY, UK c Istituto di Radiologia, Università di Udine, Italy d Breast Care Unit, Addensbrooke Hospital, Cambridge, UK
Abstract. The MammoGrid project has deployed its Service-Oriented Architecture (SOA)-based Grid application in a real environment comprising actual participating hospitals. The resultant setup is currently being exploited to conduct rigorous in-house tests in the first phase before handing over the setup to the actual clinicians to get their feedback. This paper elaborates the deployment details and the experiences acquired during this phase of the project. Finally the strategy regarding migration to an upcoming middleware from EGEE project will be described. This paper concludes by highlighting some of the potential areas of future work. Keywords. Medical imaging, Grid application, deployment, service-oriented architecture
1. Introduction The aim of the MammoGrid project was to deliver a set of evolutionary prototypes to demonstrate that ‘mammogram analysts’, specialist radiologists working in breast cancer screening, can use a Grid information infrastructure to resolve common image analysis problems. The design philosophy adopted in the MammoGrid project concentrated on the delivery of a set of services that addresses user requirements for distributed and collaborative mammogram analysis. Inside the course of the requirement analysis (see [1] and [2] for details) a hardware/software design study was also undertaken together with a rigorous study of Grid software available from other concurrent projects (now available in [3]). This resulted in the adoption of a lightweight Grid middleware solution (called AliEn (Alice Environment) [4]) since the first OGSA-compliant Globus-based systems were yet to prove their applicability. Additionally, AliEn has since also been selected as a major component of the gLite middleware for the upcoming EU- funded EGEE (Enabling Grid for E-sciencE) [5] project, as discussed in Section 4. In the deployment phase, the AliEn middleware has been installed and configured on a set of novel ‘Gridboxes’, secure hardware units which are meant to act as the hos-
60
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
pital’s single point of entry onto the MammoGrid. These units are configured and tested at all the sites including CERN, Oxford, and the hospitals in Udine (Italy) and Cambridge (UK). While the MammoGrid project has been developing, new layers of Grid functionalities have emerged and hence this has facilitated the incorporation of new stable versions of Grid software (i.e. AliEn) in a manner that catered for a controlled system evolution that provided a rapidly available lightweight but highly functional Grid architecture for MammoGrid. The MammoGrid project federates multiple databases as its data store(s) and uses open source Grid solutions in comparison to the US NDMA [6] project which employs Grid technology on centralized data sets and the UK eDiamond [7] project which uses an IBM-supplied Grid solution to enable applications for image analysis. The approach adopted by the GPCALMA project [8] is similar to MammoGrid in the sense that it also uses AliEn as the Grid middleware but in addition to this it also uses High Energy Physics (HEP) software called PROOF for remote analysis. While GPCALMA is focusing hospitals at the national scale, MammoGrid focuses mammography databases at the international scale. The structure of this paper is as follows. In Section 2 we describe MammoGrid prototypes (i.e. P1 and P1.5) briefly, and Section 3 presents details of the deployment environment. Description of the migration strategy towards the new middleware provided by EGEE project is presented in Section 4. And, in Section 5, conclusions are drawn with emphasis on potential areas for future work.
2. MammoGrid Architectural Prototypes The development of the MammoGrid architecture has been carried out over a number of iterations within the system development life cycle of this project, following an evolutionary prototyping philosophy. The design and implementation was dictated by an approach which concentrated on the application of existing and emerging Grid middleware rather than on one which developed new Grid middleware. Briefly, the work on the development of the system architecture of the MammoGrid project was led from use-case and data model development while evolving the system and system requirements specification. 2.1. MammoGrid Architecture: P1 Prototype The MammoGrid prototype P1 is based on a service-oriented architecture (medical imaging services and Grid-aware services) and has been described in detail in [9] and [10]. It enables mammograms to be saved into files that are distributed across ‘Grid-boxes’ on which simple clinical queries can be executed. In P1 the mammogram images are transferred to the Grid-Boxes in DICOM [11] format where AliEn services could be invoked to manage the file catalog and to deal with queries. Authenticated MammoGrid clients directly invoke a set of medical image (MI) services and they provide a generic framework for managing image and patient data. The digitized images are imported and stored in DICOM format. The MammoGrid P1 architecture includes a clinician workstation with a DICOM interface to Web Services and the AliEn middleware network. There are two sets of Services; one, Java-based comprising the business logic related to MammoGrid services and the other, Perl-based, the AliEn specific services for Grid middleware. A list of these services is shown in Figure 3 in
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
61
Figure 1. Prototype P1.5 Components.
Section 3.1. As this architecture is Web Services-based, SOAP messages are exchanged between different layers. RPC calls are exchanged from Java specific services to Alien specific services. The main characteristics of P1 surround the fact that it is based on a Service Oriented Architecture (SOA), i.e. a collection of co-ordinated services. This prototype architecture is for a single ‘virtual organisation’ (VO), which is composed of the Cambridge, Udine, and Oxford sites and a central server site at CERN. Each local site has a MammoGrid grid box. As mentioned earlier the Grid hosting environment (GHE) is AliEn, which provides local and central middleware services. The local grid box contains the MammoGrid high level services and local AliEn services. The central grid box contains the AliEn central services. Further details of this prototype are available in [9,10]. 2.2. MammoGrid Architecure: P1.5 Prototype The MammoGrid P1.5 architecture, schematically shown in Figure 1, consists of an improved modular design of the MammoGrid software, enhance query handling, coupled with the first release of the upcoming middleware called gLite [5]. Access to gLite is through the Grid Access Servcie (GAS), which is discussed in more detail in Section 4.2. In the P1.5 architecture, the databases are distributed and are outside of the gLite software. Consequently the Grid middleware is mainly used to for job execution and to manage files between sites (transfer and replication). There are three different kinds of databases in P1.5 – local database, meta-data database and grid database. The local database stores the local data comprising the local patient’s associated data. The meta-data database stores the data model of the local database. The grid database stores gLite/middleware related information. Since the databases and associated metadata are distributed, the queries are executed in a distributed environment. In this context the new services included in P1.5 (see Figure 2) include a Query Manager (QM) service, which is used to manage the queries locally. There is one QM per database per site and it consists of two main components: the Local Query Handler which is responsible for executing the query locally and the Query Distributor that is responsible for distributing the queries to other sites in case of Global queries. Additionally there is a service called QueryTranslator, which translates the clientdefined query coming from the above layers (e.g. in XML format) into a SQL state-
62
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
Figure 2. Implementation View of the P1.5 Prototype: Query Handling Context.
ment. Similarly the ResultSet translator translates the results of query execution at local/global site into the desired XML format. The Data Access Service (DAS) provides access to the local database. 2.3. Comparison of P1 and P1.5 Since the Prototype P1 was designed to demonstrate client- and middleware-related functionality, the design of the MammoGrid API was kept as simple as possible using a handful of web-service definitions. In essence, the focus of the P1 architecture was centred on the idea of using the existing technologies “as is” and providing basic functionalities. Since AliEn was the selected Grid middleware, the API design was largely dependent on the features provided by its design. For example, one of the major constraints was the fact that the database was tightly coupled with the file catalogue and the design of AliEn also dictated a centralized database architecture. The MammoGrid requirement analysis process revealed a need for hospitals to be autonomous and to have ‘ownership’ of all local medical data. Hence in the P1.5 architecture, the database has been implemented outside of the Grid. The main reason is that having the database environment within the Grid structure made it difficult to federate databases to the local sites using the AliEn software. The only possibility was to change the design of the middleware, which is outside of the scope of the MammoGrid project. The P1.5 design follows a distributed database architecture where there is no centralization of data. All the metadata are federated and are managed alongside the local medical data records and queries are resolved at the hospitals where the local data are under curation. Another main feature of the P1.5 architecture is the provision of distributed queries. Since the data resides at each hospital site, the queries are executed locally; moreover the clinician can choose to execute a local or global query. The P1 architecture is rather tightly coupled and it was intended that the P1.5 architecture should be loosely coupled as a good software design principle. Tight coupling is an undesirable feature for Grid-enabled software, which should be flexible and interoperable with other Grid service providers. Thus a key requirement of the P1.5 (and future architectures) is to conform to loosely coupled design principles. For exam-
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
63
ple the data layer was loosely separated from the high-level system functionality so that the security requirements can be defined and maintained transparently. In the P1 architecture, an intermediate layer was introduced between the MammoGrid API and the Grid middleware. The purpose of this layer was to provide an interface between the application layer and the Grid middleware layer. Additionally this also served the purpose of providing a gateway to MammoGrid specific calls. This layer was rather complex and its design was not fully optimised such that some of the core MammoGrid API calls were implemented inside its territory. In the P1.5 design these MammoGrid specific calls are migrated from this thick layer to the application layer. This feature was mainly introduced to ease the programming of the MammoGrid API. However this was done at the expense of performance as this increases the SOAP calls from the MammoGrid API to the Grid-aware services. The above-mentioned factors have had a significant impact on the performance of P1.5 architecture. Since the database is outside the Grid so the overhead of Grid layers is eliminated. Similarly the distributed query model enhances the performance because first the query is executed locally and if desired the query is distributed to the other nodes in the Grid. Although the migration of the MammoGrid specific calls was done at the expense of performance its overall impact is covered by the other factors. Hence it is expected that in terms of performance the P1.5 design will be better than the P1 (see 3.2 Section).
3. Deployment and Testing 3.1. Physical Layout The deployment of P1 is based on a Central Node (CN at CERN) which is used to hold all AliEn metadata, its file catalogue, and a set of Local Nodes (LNs at Oxford, Cambridge and Udine) on which clinicians capture and examine mammograms using the clinicians’ workstation interface. The list of services, which should be running on the CN and on the LNs, is shown in Figure 3. These services include both the AliEn services and the MammoGrid services. The CN and LNs are connected through a Virtual Private Network (VPN). According to the AliEn/gLite architecture there is a set of centralised services which should be running on the CN. If the CN is also acting as a LN then the services which are meant to run on the LN are also running on the CN. A detailed discussion on this issue could be found at [12]. In order to perform a clean installation on the sites, a test environment has been created at CERN to verify the installation steps. One important point to be noted was that the Grid Box had to be installed at the actual site from scratch as the IP address of the machine changes from the local site to the actual site and all the host certificates had to be re-issued. This re-issuing could only take place once the IP address had been allocated to the machine on the actual site. 3.2. Deployment Issues As progress has been made in the definition of new Grid standards, AliEn has also undergone evolution in its design. Building a Grids infrastructure in such dynamic circumstances is consequently challenging. In the development and testing phase of
64
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
Figure 3. Deployment View of P1 Prototype: Physical Configuration.
MammoGrid it was observed that most of the time bugs were found because of the inconsistencies in the AliEn code. It should be noted that AliEn is made out of several open source components and during the compilation phase it downloads most of the open source components directly from the Internet and this can give rise to incompatibilities and inconsistencies between different components. In order to deal with this situation a particular design strategy has been adopted. This strategy was different for P1 and P1.5 prototypes. In P1, the working and tested AliEn version was frozen for our normal development works. In P1.5, in addition to the previous task, the newer AliEn versions were tracked and a move to the newer version was made once this particular version was verified and tested. As mentioned in Section 2, the MammoGrid project has adopted an evolutionary prototyping approach, so by the end of the first year the Prototype P1 was ready and measures were taken for the deployment of this prototype. In the meantime, work on the second prototype had been started. By the time the MammoGrid P1 prototype was deployed in the actual hospital setting the P1.5 architecture was well advanced and was ready to be tested. The initial testing was done in a special testing environment but later it was required to do this in the clinical environment. In order to provide ease of migration a strategy has been adopted in which both the environments could be invoked anytime with minimum effort. According to this strategy the required directories (in the Linux file system), which are used during installation, were mounted and unmounted on the desired locations so as to switch between the two environments. During the deployment the MammoGrid actual VO has been used for P1. And for P1.5 a Test VO has been created and all the development and testing for the new architecture was done under this Test VO. It should be noted that the deployment of P1 was AliEn based and the current setup of P1.5 is gLite based (see Section 4 for details). The Mammogrid project is keeping track of changes in the gLite code such that it is also involved in providing feedback to the developers of the new middleware.
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
65
Enabling secure data exchange between hospitals distributed across networks is one of the major concerns of medical applications. Grid addresses security issues by providing a common infrastructure (Grid Security Infrastructure (GSI)) for secure access and communication between grid-connected sites. This infrastructure includes authentication mechanisms, supporting security across organizational boundaries. One solution regarding the Grid box network security is a hardware-based device, which is, a highly integrated VPN solution. The VPNs could be established both in software and hardware. In the case of software VPNs the entire load of encryption and decryption will be borne by the Grid box, resulting in reduced performance. In the case of hardware VPNs, a specialized device called VPN router is used which is responsible for doing all of the encryption, decryption and routing. The VPN router is able to use most connection types, and meet the most demanding security requirements. The measured network throughput when this device is connected to the Grid box with 100 MB WAN port is 11 MB /sec which shows an excellent speed, suitable for MammoGrid needs. In order to preserve privacy, patients’ personal data is partially encrypted in the P1 architecture of MammoGrid thereby facilitating anonymization. As a second step in P1.5, when data is transferred from one service to another through the network, communications are established through a secure protocol HTTPS with encryption at a lower level. Each Grid box is made a part of a VPN in which allowed participants are identified by a unique host certificate. Furthermore the access from one GridBox to another has been restricted to a given IP addresses. In addition to that only relevant communication ports (AliEn ports and remote admin ports) are opened. This has been implemented by configuring the router’s internal firewall. Lastly, the P1 architecture is based on trusting the AliEn certification authority. However the P1.5 architecture will improve on these security issues by making use of MammoGrid’s own certification authority. 3.3. Testing Procedure The deployment of the MammoGrid Information Infrastructure prototypes was a very crucial and important milestone. This phase has so far helped greatly in gaining an insight into the practicality of the medical problem at hand. Some unknown and unresolved issues have been exposed. This section describes the testing procedure followed by some comments on results and performance. In order to perform tests regarding the deployment, a testing plan has been designed in which all aspects of the testing environment were considered. These tests can be divided into two major parts. The first part deals with the actual testing of the middleware calls. This means that the middleware deployed on all sites has been tested for consistency and performance. These tests included: 1) 2) 3) 4)
Replicating files between all the sites (i.e. “mirror” command of the middleware). Retrieving files between sites in both ways (i.e. “get” command of the middleware). Viewing files from a remote site in a local site (i.e. “cat” command of the middleware). Executing jobs on all the nodes (i.e. “submit” command of the middleware).
Similarly other sets of tests included tests of the MammoGrid portal's calls. These tests included loading the DICOM files from the portal, retrieving the files, executing com-
66
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
plex queries and executing jobs. The job execution was tested both from the portal and from inside AliEn. The tests were conducted during the daytime (i.e. the actual operational hours) and by using the standard DICOM files of approximately 8.5 MB. It is obvious that the transfer speed between the nodes plays an important role in terms of overall performance. Furthermore, the Grid overhead was apparent while transferring the data between the nodes as compared to the performance of the direct transfer protocols. This was an expected outcome because the use of Grid obviously causes some overhead since different services are called in the complete workflow. This is the price that is required to be paid in order to gain the benefits of Grid such as automatic distribution and scheduling of the jobs, accessing of the resources in various administrative domains and acquiring the collective goal in collaboration. Job execution on each node was also performed. The jobs, which have been distributed through the Grid, demonstrated the usefulness of Grid, as running the same jobs manually on different nodes was very time consuming. One important point that has been noted during execution of the jobs is that they require to be distributed optimally. For example, the jobs were not assigned to the nodes where the actual data was residing instead the selection of nodes was done randomly. This aspect is expected to improve in gLite middleware to achieve optimal performance. It should be noted that the job execution was mainly focused on executing algorithms related to the Standard Mammogram Form (SMF) and Computer Aided Detection (CADe) algorithms of MammoGrid. In the current deployed architecture the query is centralized because of the centralized database in the P1 prototype as compared to that in P1.5, in which a distributed query execution strategy is undertaken. In the current setup the query handling has been tested for two cases. The first set relates to simple queries and the second includes queries with conjunctions. The simple queries were meant to execute on single tables whereas multiple joins were involved in the conjunctions across multiple tables. The P1.5 prototype is expected to improve the performance of the overall system especially in all the calls where the database is involved. The Grid related calls will also be improved but we are not expecting any drastic changes. In the next section some of the areas of future work are highlighted.
4. Future Works 4.1. Migration to EGEE Middleware At the outset of the MammoGrid project there were no demonstrable working alternatives to the AliEn middleware and it was selected to be the basis of a rapidly produced MammoGrid prototype. Since then AliEn has steadily evolved (as have other existing middlewares) and the MammoGrid project has had to cope with these rapidly evolving implementations. Additionally AliEn has been considered as the basis of middleware for numerous applications. For example during the early analysis of ARDA (A Realisation of Distributed Analysis) [13] initiative at CERN, AliEn was found to be the most stable and reliable middleware for the production phases of ALICE experiment. In the meantime, work has been started in a new EU-funded project called EGEE, which is developing a new middleware called gLite. The architecture of gLite is based on experiences from several existing middlewares including AliEn. It is expected
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
67
that in gLite there will be several improvements in the underlying architecture; our recent experience has revealed that the interface of this middleware is similar to AliEn. If this design persists then it will be more or less seamless for projects like MammoGrid to become early adopters of this new middleware. In essence, the selection of AliEn as a core component in EGEE project has put MammoGrid project at an advantage. For example, MammoGrid’s team already have expertise to deal with the upcoming middleware i.e. gLite. 4.2. gLite – New Features According to the specifications of gLite [14], a unique interface has been provided that will act as a user entry point to a set of core services, which is called the Grid Access Service (GAS). When a user starts a Grid session, he/she establishes a connection with an instance of the GAS created by the GAS factory for the purpose of the session. During the creation of the GAS the user is authenticated and his/her rights for various Grid operations are checked against the Authorization Services. The GAS model of accessing the Grid is in many ways similar to various Grid Portals but it is meant to be distributed (the Gas factory can start GAS in a service environment close to the user in network terms) and is therefore more dynamic and reflects the role of the user in the system. The GAS offers no presentation layer as it is intended to be used by the application and not by the interactive user. The GAS feature would be very useful as it provides an optimised Grid access that in turn will ameliorate overall performance of the system. According to [14], in gLite all metadata is application specific and therefore the applications and not the core middleware layer should optimally provide all metadata catalogs. There can be callouts to these catalogs from within the middleware stack through some well-defined interfaces, which the application metadata catalogs can choose to implement. In the version of AliEn, which has been used in the first phase of MammoGrid deployment, the metadata is middleware specific, which not only puts constraints on the design of metadata but also after a series of several AliEn calls, the database is accessed which in turn puts constraints on the speed of processing. 4.3. Prototype 2 The P1 design is based on a single-VO architecture, implying a partial compromise over the confidentiality of patients’ data. As mentioned in Section 3.2, in the P1 architecture all metadata is centralized in one site and this requires the copying of some summary data from the local hospital to that site in order to satisfy the predicates required for query resolution. Practically this is not feasible other than in specialist research studies where appropriate authority for limited replication of medical data between hospitals can be sought and granted. P1.5 provides an ad-hoc solution by taking the database out of the Grid to achieve better performance and control of the database. Another aim of P1.5 is to achieve a genuine distributed database architecture. The P2 design will be based on a multi-VO architecture where there will be no centralization of data, all metadata will be managed alongside the local medical data records and queries will be resolved at the hospitals where the local data are under governance. This will provide a more realistic clinical solution, which matches more closely the legal constraints imposed by curation of patient data. With the incorporation of a multi-VO set up, the overall metadata will become truly federated inside the Grid and will provide a more secure solution for the medical community.
68
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
Furthermore, full adaptation of the Grid philosophy suggests that the federation should not be confined to the database level but should be realized in all aspects of Grid services (including areas such as authentication, authorization, file replication, etc.). The solution, which will be adopted in P2 design, is a federation of VOs (possibly hierarchical in structure), where the boundaries of the organizations naturally define the access rights and protocols for data-flow and service access between ‘islands’ that provide and consume medical information. As the P1 and P1.5 designs were based on simple Web Services, interoperability with other middleware was not practically possible. This is mainly due to the fact that different Grid hosting environments require different underpinning technologies; the lack of common communication protocols results in incompatibilities. While OGSA [15] has adopted a service-oriented approach to defining the Grid architecture, it says nothing about the technologies used to implement the required services and their specific characteristics. That is the task of WSRF [16], which is defining the implementation details of OGSA. WSRF is the standard that is replacing OGSI [17], which was the previous standard for specifying OGSA implementation. The WSRF working group had opted to build the Grid infrastructure on top of Web services standards, hence leveraging the substantial effort, in terms of tools and support that industry has been putting into the field in the context of Web Services standardization. The focus of the next MammoGrid milestone will be the demonstration of Grid services functionality and to that end it is planned to implement an (OGSA-compliant) Grid-services based infrastructure. As a result the focus will be on interoperability with other OGSA-compliant Grid services. It is observed that the major trend in Grid computing today is moving closer to true web-services (from OGSI) and we expect a convergence between different Grid protocols in time for the completion of existing MammoGrid web-services design.
5. Conclusions The MammoGrid project has deployed its first prototype and has performed the first phase of in-house tests, in which a representative set of mammograms have been tested between sites in the UK, Switzerland and Italy. In the next phase of testing, clinicians will be closely involved in performing tests and their feedback will be reflectively utilised in improving the applicability and performance of the system. In its first two years, the MammoGrid project has faced interesting challenges originating from the interplay between medical and computer sciences, and has witnessed the excitement of the user community whose expectations from a new paradigm are understandably high. As the MammoGrid project moves into its final implementation and testing phase, further challenges are anticipated. In conclusion, this paper has outlined the MammoGrid application’s deployment strategy and experiences. Also, it outlines the strategy being adopted for migration to the new middleware called gLite.
Acknowledgements The authors thank the European Commission and their institutes for support and acknowledge the contribution of the following MammoGrid collaboration members: Predrag Buncic, Pablo Saiz (CERN/AliEn), Martin Cordell, Tom Reading and Ralph
S.R. Amendolia et al. / Deployment of a Grid-Based Medical Imaging Application
69
Highnam (Mirada) Piernicola Oliva (Univ of Sassari) and Alexandra Rettico (Univ of Pisa). The assistance from the MammoGrid clinical community is warmly acknowledged especially that from Iqbal Warsi of Addensbrooke Hospital, Cambridge, UK and Dr Massimo Bazzocchi of the Istituto di Radiologia at the Università di Udine, Italy. In addition Dr Ian Willers of CERN is thanked for his assistance in the compilation of this paper.
References [1] T. Hauer et al, “Requirements for Large-Scale Distributed Medical Image Analysis”, Proceedings of the 1st EU HealthGrid Workshop. pp 242-249. Lyon, France. January 2003. [2] M. Odeh et al, “A Use-Case Driven Approach in Requirements Engineering: the MammoGrid Project”, Proceedings of the 7th IASTED Int. Conference on Software Engineering & Applications Editor M H Hamza ISBN 0-88986-394-6 pp 562-567 ACTA Press. Marina del Rey, CA, USA. November 2003. [3] F. Berman et al, “Grid Computing – Making the Global Infrastructure a Reality”, Wiley Series in Communications Networking & Distributed Systems, ISBN 0-470-85319-0, 2003. [4] P. Buncic et al, “The AliEn system, status and perspectives”, Proceedings of CHEP’03 San Diego, March 24th, 2003. [5] EGEE Project. Details available from http://public.eu-egee.org/. [6] NDMA: The National Digital Mammography Archive. Contact Mitchell D. Schnall, M.D., Ph.D., University of Pennsylvania. See http://nscp01.physics.upenn.edu/ndma/projovw.htm. [7] D. Power et al, “A relational approach to the capture of DICOM files for Grid-enabled medical imaging databases”, ACM SAC’04, Nicosia, Cyprus, March 14-17, 2004. [8] S. Bagnasco et al., “GPCALMA: a Grid Approach to Mammographic Screening”, Nuclear Instruments And Methods In Physics Research Section A: Accelerators, Spectrometers, Detectors And Associated Equipment, Vol. 518, Issue 1, 2004, p. 394-98. [9] S. R. Amendolia et al., “Grid Databases for Shared Image Analysis in the MammoGrid Project”. Proceedings of the Eighth International Database Engineering & Applications Symposium (Ideas’04). IEEE Press ISBN 0-7695-2168-1 pp 302-311. Coimbra, Portugal. July 2004. [10] S. R. Amendolia et al, “MammoGrid: A Service Oriented Architecture based Medical Grid Application”. Lecture Notes in Computer Science Vol 3251 pp 939-942 ISBN 3-540-23564-7 SpringerVerlag, 2004. (Proceedings of the 3rd International Conference on Grid and Cooperative Computing (GCC 2004). Wuhan, China. October 2004). [11] DICOM Digital Imaging and Communications in Medicine. http://medical.nema.org. [12] P. Saiz et al, AliEn Resource Broker, CHEP 03, La Jolla, California, 24-28th March 2003. [13] ARDA Project, See http:// www.cern.ch/lcg/peb/documents/ardaproject.pdf [14] EGEE Middleware Architecture, Document identifier: EGEE-DJRA1/1-476451-v1.0, Available from http://public.eu-egee.org/. [15] I. Foster et al, “The Physiology of the Grid”, Global Grid Forum Draft Recommendation, Available from http:// www.globus.org/research/papers/ogsa.pdf, June 2002. [16] I. Foster et al, “The WS-Resource Framework”, Available from http://www.globus.org/wsrf/. [17] S. Tuecke et al, “Open Grid Services Infrastructure (OGSI)”, Global Grid Forum Draft Recommendation, February 17th 2003, GT3, Globus Toolkit 3. Available from http://www-unix.globus. org/toolkit/.
70
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Developing a Distributed Collaborative Radiological Visualization Application Justin BINNS b, Fred DECH a, Matthew MCCRORY b, Michael E. PAPKA b,c, Jonathan C. SILVERSTEIN a,c and Rick STEVENS b,c a Department of Surgery, The University of Chicago b Mathematics and Computer Science Division, Argonne National Laboratory c Computation Institute of The University of Chicago and Argonne National Laboratory Abstract. Leveraging the advances of today’s commodity graphics hardware, adoption of community proven collaboration technology, and the use of standard Web and Grid technologies a flexible system is designed to enable the construction of a distributed collaborative radiological visualization application. The system builds from a prototype application as well as requirements gathered from users. Finally constraints on the system are evaluated to complete the design process.
1. Introduction The goal of this effort is to extend the use of volume rendering in radiology beyond a users desktop, extending use to a variety of different settings. This effort is part of the Advanced Biomedical Collaboration Testbed in Surgery, Anesthesia, Emergency Medicine and Radiology 1; which is developing a technical framework based on the Access Grid. The framework is intended to be used to overcome the inefficiencies and dangers associated with the place-dynamic collaborative workplace that biomedicine has become. An early prototype application, focusing on collaboration and integration with the Access Grid, has been implemented and demonstrated to gain initial feedback and aid in requirements gathering. A description of the prototype is included, as well as a summary of the lessons learned through the development of the prototype (Silverstein et al., 2005). Modern medical instruments are producing ever larger and more detailed data about the current state of a patient. Understanding this wealth of information is key in making the right decisions during the course of treatment. Everything from highly detailed MRI or CT scans to synthetic data produced from combinations of instruments must be visualized and analyzed in order for the proper plan of action to be reached. Additionally, modern medical science and patient analysis has become a rich collaborative activity. Complex medical problems often involve many physicians and surgeons cooperatively analyzing the same patient data, in order to develop a comprehensive treatment plan. Rich collaborative data exploration of large data sets by many, potentially geographically disbursed, clinicians would allow for a much faster, more complete analysis. 1
cci.uchicago.edu/projects/abc.
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
71
While such systems have been attempted before, they have generally met with only limited experimental success and have generally required very expensive hardware. Advances in computer technology, however, have finally allowed us to attempt a cohesive, practical visualization and analysis tool utilizing commodity hardware and integration with rapidly maturing collaboration tools. This application will allow clinicians to investigate, with teams of collaborators when necessary, the wealth of information available about a patient in order to develop a more comprehensive understanding and a more complete plan of action. A great deal of prior work has been done in fields impacting this application, including collaboration technology, remote and distributed visualization, and computerbased radiological visualization. In the collaborative technology space, the Access Grid in particular is the current state-of-the-art group-to-group collaboration software, resulting from the efforts of a worldwide community (Stevens, 2003). Integrating visualization, initially remote visualization with a single dedicated controller and more recently truly collaborative visualization with shared control, is a more recent effort that has been the focus of work by several groups (Foster et al., 1999, Olson and Papka, 2000, Prohaska et al., 2004). Prior experiments and demonstrations have proven the concept of remote, distributed, collaborative visualization within a scientific workspace – extending those concepts to the biomedical communities is a logical progression. Previous work in radiological visualization and, in particular, collaborative radiological visualization has been carried out utilizing high-end equipment such as ImmersaDesks (Silverstein and Dech, 2004). These high-end immersive displays are very compelling, but impractical for deployment in a clinical setting. Additionally, while many of the fundamental features and capabilities driving the application described by this paper were first explored in these early experiments, recent advances in graphics and computational ability are bringing the application of these techniques to real, highresolution data. Combining the enhanced capability with the much lower cost of the systems leads to a reachable goal of direct deployment of distributed collaborative radiological visualization systems within clinical settings. This paper describes an effort underway to construct a distributed collaborative radiological visualization application.
2. Exploration of Capabilities and Requirements via Prototype Application In order to explore remote rendering capabilities, Access Grid interaction, and the development of a visualization tool using commodity technologies a prototype distributed collaborative radiological visualization application which focuses on these capabilities was built. This prototype was also demonstrated to clinicians to solicit feedback on requirements and feature requests for the full, deployable application. The prototype made use of a variety of technologies to provide true collaborative interaction over wide-area networks utilizing remote visualization techniques on commodity hardware. Four key technologies were used to construct the prototype application: • •
integrated collaboration and coordination support were provided by Access Grid technology; distributed rendering support was provided by Chromium (Humphreys et al., 2002) with encoding support added, to accelerate the visualization task;
72
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
Figure 1. Design of the prototype distributed collaborative radiological visualization application.
• •
rendering and data handling algorithms were leveraged from the Visualization Toolkit (Schroeder et al., 1996); and XML-RPC was used as a web services protocol to enable low-latency state updates for keyboard and mouse events (St.Laurent et al., 2001).
The prototype application consists of four major components (See Figure 1). These four major components are grouped, two on the server side and two on the client side. The server side of the prototype system is made up of two primary components. These two components work together to provide the necessary functionality and integration with the Access Grid collaboration system. The first component, the visualization server, is an entirely independent, web-services controlled DICOM visualization service. The visualization server utilizes Chromium to distribute the rendering and encoding of the image to multiple machines, creating a higher resolution visualization than would otherwise be possible, at last with h.261 video encoding (which is limited to 352x288 pixels per stream). Additionally, the visualization server uses the Visualization Toolkit to read in data from local disk, and to volume render that data. Finally, the visualization server utilizes XML-RPC to receive all state updates, including mouse and keyboard interaction to manipulate position, rotation and scale, and complex state updates to vary the parameters of the visualization algorithm (such as alpha map parameters and region-of-interest configuration) and load new data sets. The second component making up the server side of the prototype application is the server-side Access Grid Integration mechanism. This piece of code provides the Python glue necessary to interact with the Access Grid Toolkit version 2.x. This integration component is responsible for receiving state updates from the Access Grid shared application communication channel and passing them on to the visualization server. This component is also responsible for retrieving data from the Access Grid Virtual Venue for analysis and visualization. The client side of the prototype system is also made up of two components. Unlike the server side, the client components may be run by many users simultaneously, allowing for truly collaborative interaction with the data. Floor control is handled via social mechanisms, which have been found to be sufficient for a range of applications. The first component in the client side of the system is the Access Grid client component. This is a Python application that integrates with the Access Grid Toolkit version 2.x, receiving and changing state utilizing the mechanisms provided by the Access Grid shared application framework. The Access Grid client also includes a user inter-
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
73
face for many of the data visualization parameters available, such as alpha map parameters and data loading operations. Finally, the Access Grid client provides a filter through which mouse and keyboard events collected by the display component are sent to the visualization server. This filter is important, as it allows for the implementation of explicit floor control, should such be required, as well as for aggregation of events and isolation of the display component from web services operations. The second component in the client side of the prototype application is the display component. This component, based on the OpenMASH VIC video conferencing tool (McCanne et al., 1997), receives multicast video streams from the visualization server, assembles them into a single coherent image, collects mouse and keyboard events, and forwards those events to the Access Grid client. It is this component that provides the most obvious window into the operations of the application, and it is the capture of mouse and keyboard events that provides the rich, intuitive interaction that users have come to expect. The development of the prototype application was an important process that led to a variety of conclusions and lessons learned. Not only were we able to demonstrate the potential benefit of a collaborative visualization experience in radiology, we were also able to explore some features and requirements through direct demonstrations and conversations with clinicians that otherwise were only conjecture. Through this experience, we gained a number of insights: •
•
•
•
Performance, even with advanced graphics hardware, continues to be a difficulty. The prototype, while performing adequately to demonstrate the concepts involved, was far too slow and cumbersome to consider deploying in a clinical setting. The introduction of synthetic color to the image, while visually compelling, was not particularly significant to the medical professionals who reviewed the demonstration application. We believe that this observation is at least partly due to the simple fact that existing technologies almost exclusively are without color, rather than any particular aversion to using color to help understand complex data. Further exploration in this area is definitely required in order to understand how the addition of color information may be able to broaden the interpretive capabilities of clinicians. Social floor control, while sufficient for small collaborations, could quickly become troublesome. We determined that future applications supporting collaborative visualization should, whenever possible, provide some explicit floor control mechanisms. Supporting a broad variety of interaction modalities is vital. While the simple modality provided by the prototype application was enough to demonstrate the basic capability of this type of system, the users quickly developed ideas for additional methods of interaction, such as utilizing a wireless tablet for control of a high-resolution, possibly stereo, display, or using a heads-up display system in an operating room. Supporting these modes is an important requirement for future development.
The observations above were developed through direct interaction with clinicians and other users during demonstration sessions. These conclusions, as well as several direct conversations with potential users, have helped us develop a better understanding of the requirements for a fully deployable radiological visualization system.
74
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
3. Current System 3.1. Requirements Analysis The observations above were developed through direct interaction with clinicians and other users during demonstration sessions of the prototype applications. Additionally, we have considered a number of specific use case scenarios and had several conversations with our potential users regarding desired features of the final application. These activities have helped us develop a set of requirements for the application currently under development. The target use cases for the distributed collaborative radiological visualization application being developed include practical situations wherein medical personnel, including physicians and surgeons, perform real-time analysis of large data in order to facilitate life-critical decision-making. These operations may occur in a distributed, collaborative environment, or they may occur in a single user environment, depending on the size, severity, and complexity of the problem being faced. These usage models lead to a set of strict requirements, including stability, flexibility and performance. Stability is a key factor, as medical instruments must function when required without error or fail. If an instrument proves troublesome, either by working improperly or outright failing, medical professionals may simply abandon its use. For a new technology to gain traction within the industry, rock-solid fail-proof operation is a necessity. Flexibility is important in order to allow the application to meet with the broad variety of demands that inevitably arise during clinical use. Handling a wide variety of data formats, of a wide variety of sizes and resolutions, as well as being able to make use of clusters of machines when available and appropriate to improve performance and capability are all critical system requirements. Having the flexibility of design to adapt to new communication and processing technologies, as well as new instruments and clinical techniques, is equally important if the application is going to be a longterm success. Finally, performance is a criterion that cannot be overlooked. While many current techniques fail to provide real-time interactive exploration of medical data, achieving real-time exploration must remain the goal. The less delay is present in every aspect of the system – from acquisition of data to interaction with the data and rendering of the results – the more quickly and thoroughly physicians and surgeons can understand the relevant problem, leading to a more well-reasoned decision. This high-performance analysis environment is particularly crucial when medical staff is faced with lifethreatening situations of extreme illness or injury. 3.2. System Overview The target platform for the distributed collaborative radiological visualization application consists of a single workstation-class personal computer with a high-performance commodity video card, with the optional addition of either local or remote ‘server’ resources (i.e., a single machine or small cluster dedicated to providing back-end processing for the primary visualization application). Additionally, a wireless interaction device, such as a Tablet PC, will often be included in a functional deployment. As the application is Access Grid enabled, and makes extensive use of the Access Grid to support its collaborative functions, functioning multicast on a relatively high-bandwidth network (at least 1Gbps locally, with a wide-area connection capable of at least 100
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
75
Mbps) is also required to support these features. In order to support the wireless interaction device, a wireless LAN of sufficient reliability and speed must be available. In the simplest case, however, it should be possible to perform some visualization and analysis even using a single, isolated machine. A flexible system has been designed in order to meet the diverse requirements outlined above on the target platform. This system consists of a set of components, each responsible for one element of the visualization/analysis task. Each of these components has a clear, well-defined interface that provides the requisite functionality, and as the designed interface is the only means of interaction between components, multiple implementations of each component can be easily developed to meet the various capability requirements. The four components that are elements of the larger application are the control component, consisting primarily of the user interface in the client context and a webservices interface in the service context, the state management component, which is responsible for maintaining all state for the running instance, the rendering component, which is responsible for the actual visualization and rendering of the data, and the data component, which is responsible for acquisition and filtering of input data to provide it in the format necessary for the rendering component. These four components work together to present a highly modular application while maintaining a communication structure that is simple enough to be able to support the strong reliability requirement necessary for practical use. 3.3. Component Details The control component is the interface to the application. In the case of either a standalone application session, or the client (user display) portion of a distributed session, the control component is primarily responsible for providing the user interface of the application. This includes setting up a rendering context for the rendering component to utilize, as well as building all the various windows and menus and processing the resulting events. No state changes are actually made by the control component – rather, events trigger state changes by calling into the state management component. In this way, the particular state management component can perform the appropriate operations with regards to local and/or remote resources of various types, as described below. In a service setting, where the application is providing support of some kind (remote rendering, enhanced display, remote data processing), the control component is primarily a web service provider, again responding to events by calling into the local state management component. The functional difference is that the events are generated by remote procedure calls through the web services interface instead of being generated by user interface actions. The state management component is in many ways the heart of the application. Management of state is essential for the coordination of a collaborative session, and the state management component is designed explicitly for that purpose. By keeping all application state in a single component space, that state may be coordinated with remote service instances, synchronized with an Access Grid shared application session, or simply stored locally depending on the current work modality, all without dependence on any other aspect of the system. The state management component is responsible for making the various web service calls necessary to update state on service resources, and is also responsible for integrating with the Access Grid.
76
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
Figure 2. The simplest case, whereby the clinician is investigating a radiological data set directly on a workstation class machine. All components are local to the machine, and communication between them is via standard function-call semantics.
The rendering component is the workhorse of the application. It is responsible for reading data from the data component, applying the various rendering algorithms requested by the user, with the various parameters dictated by the user (all retrieved from the state management component), and producing imagery to display to the user. Rendering components may be local, in which case they will render directly into the rendering context provided by the control component, or they may be remote, in which case they will render into a virtual context and stream the resulting pixel data via wellknown video streaming protocols. In the case of a remote rendering component, the client application will instantiate a special rendering component that acts as a proxy, retrieving the video stream, decoding it, and displaying it, again, in the rendering context provided by the control component. In this way, the rest of the application need not be aware of the nature of the rendering component, and the user may seamlessly shift from a local session, to a session using remote resources (perhaps to handle larger data or to improve performance by parallelizing the rendering task), to a collaborative session. The final component in the system is the data component. The data component is responsible for the acquisition of data, either from local cached disk, from an instrument, from an Access Grid Virtual Venue, or even from an URL. The data component is also responsible for inspecting the data, determining the format, and converting it to a form suitable for the rendering component to process. This complex loading and filtering task will result in a semi-custom data set that will be cached whenever possible, to avoid duplication of effort. The selection of which data to provide is taken from state stored in the state management component, and the data is then delivered directly to the rendering component, in the appropriate format, upon request. Part of the design process for this system involved analysis of the requirements for computational and network resources required by the final application. In terms of hardware, there are essentially three classes of compute resources and two functionally distinct networks to consider. First, in the simplest case (see Figure 2), the application is run entirely on a single workstation class machine, our first class of compute resources. This machine must have a relatively modern graphics card. This machine must also have liberal computational ability, spacious disk facilities, and significant amounts of memory. The second and the third use cases have similar computational components, but different network configurations. In the second case (see Figure 3), the clinician has a Tablet PC of some kind that provides a remote control and display mechanism via
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
77
Figure 3. A more complex system, where the clinician is investigating data using a wireless Tablet PC running a reduced version of the application, connected to a remote service-oriented instance of the application that is performing the rendering.
Figure 4. The most complex case, wherein multiple clinicians are collaboratively exploring a data set utilizing the Access Grid for coordination and multicast video for rendered image distribution. Here each client instance is running a slim version of the application.
wireless network technologies. A server class visualization machine provides the computational ability required for the data analysis and visualization. In this scenario, the compute capabilities required of the server class machine are very similar to those required for the workstation described above. The Tablet PC, on the other hand, has a much lower set of requirements, essentially needing only enough processing power to decode the incoming video and handle serializing the web services calls used for control. Any of the modern range of Tablet PC offerings would be more than sufficient for these tasks. In the third case (see Figure 4), which adds Access Grid integration, we add only the requirement that the Tablet PC, or some other client machine, be capable of providing the resources necessary to integrate with, and collaborate using, the Access Grid. These capabilities are generally audio and video generation and playback capability, as well as some minor compute requirements. While a simple Tablet PC may or may not be able to handle these tasks in a large collaboration, in a smaller setting it is likely that the practical requirements for this third use case would vary little from those for the second.
78
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
3.4. Use Case Constraints The constraints placed upon the networking infrastructure for these two use cases, however, are vastly different. In the second use case, the only networking constraint is that involving the communication between the visualization server and the single client interface. This would need to be wireless; however the bandwidths are easily manageable. A visualization video stream, consisting of, perhaps, 800x600 pixels, can easily be compressed to under 5Mbps of data, and can likely be compressed with minimal loss to under 3Mbps. In fact, the larger constraint is that of latency – in order for user interaction to be comfortable, there can be no more than approximately 100 milliseconds of latency between user operation and visible effect. In order for this to be true, the network must be sufficiently low latency. In the third use case, we have the additional networking constraints introduced by the Access Grid. This includes multicast support and additional bandwidth required for audio and video traffic in addition to the visualization. The introduction of the Access Grid requires the addition of a new machine to operate as a bridge, as most wireless networks deal very poorly with multicast traffic. Additionally, the bridge machine could provide some stream selection service, which would limit the number and bandwidth of non-visualization video streams. This would allow the clinician to make the best use of the wireless bandwidth available.
4. Conclusion Building from the prototype of a shared medical visualization tool and requirement gathering and analysis we have developed a high-level system design that meets the needs of clinicians and medical technicians for a deployable visualization system. The system enables the collaborative manipulation of radiological data and multimodal communication simultaneously on commodity, off-the-shelf clients. Beyond that we have designed and begun the implementation of a flexible system to increase the use of advanced rendering of medical datasets that does not require specialized graphics hardware for the clients.
Acknowledgement This work was supported in part by U.S. Department of Energy under Contract W-31109-Eng-38 and by the NIH/National Library of Medicine, under Contract N01-LM-33508.
References FOSTER, I., INSLEY, J., LASZEWSKI, G. V., KESSELMAN, C. & THIEBAUX, M. (1999) Distance Visualization: Data Exploration on the Grid. Computer, 32, 36-43. HUMPHREYS, G., HOUSTON, M., NG, R., FRANK, R., AHERN, S., KIRCHNER, P. D. & KLOSOWSKI, J. T. (2002) Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters, San Antonio, Texas, ACM Press. MCCANNE, S., BREWER, E., KATZ, R., ROWE, L., AMIR, E., CHAWATHE, Y., COOPERSMITH, A., MAYER-PATEL, K., RAMAN, S., SCHUETT, A., SIMPSON, D., SWAN, A., TUNG, T.-L., WU, D.
J. Binns et al. / Developing a Distributed Collaborative Radiological Visualization Application
79
& SMITH, B. (1997) Toward a Common Infrastructure for Multimedia-Networking Middleware. 7th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV '97). OLSON, R. & PAPKA, M. E. (2000) Remote Visualization with VIC/VTK. Visualization 2000 Hot Topics. Salt Lake City, Utah. PROHASKA, S., HUTANU, A., KAHLER, R. & HEGE, H.-C. (2004) Interactive Exploration of Large Remote Micro-CT Scans. IEEE Visualization 2004. IEEE Computer Society. SCHROEDER, W. J., MARTIN, K. M. & LORENSEN, W. E. (1996) The Visualization Toolkit: An Object Oriented Approach to 3D Graphics, Prentice Hall. SILVERSTEIN, J. C. & DECH, F. (2004) Precisely Exploring Medical Models and Volumes in Collaborative Virtual Reality. Presence. SILVERSTEIN, J. C., DECH, F., BINNS, J., JONES, D., PAPKA, M. E. & STEVENS, R. (2005) Distributed Collaborative Radiological Visualization using Access Grid. 13th Annual Medicine Meets Virtual Reality. IOSPress. ST.LAURENT, S., JOHNSTON, J. & DUMBILL, E. (2001) Programming Web Services with XML-RPC, O'Reilly. STEVENS, R. (2003) Access Grid: Enabling Group Oriented Collaboration on the Grid. IN FOSTER, I. & KESSELMAN, C. (Eds.) The Grid Blueprint for a New Computing Infrastructure. Morgan Kaufmann.
80
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Clinical Decision Support Systems (CDSS) in GRID Environments Ignacio BLANQUER a, Vicente HERNÁNDEZ a, Damià SEGRELLES a, Montserrat ROBLES b, Juan Miguel GARCÍA b and Javier Vicente ROBLEDO b a Universidad Politécnica de Valencia – Departamento de Sistemas Informáticos y Computación (DSIC), Camino de Vera s/n, 46022 Valencia, Spain {iblanque,vhernand,dquilis}@dsic.upv.es b Universidad Politécnica de Valencia – Grupo De Bioingeniería, Electrónica, Telemedicina e Informática Médica (BET), Camino de Vera s/n, 46022 Valencia, Spain
[email protected],
[email protected],
[email protected]
Abstract. This paper presents an architecture defined for searching and executing Clinical Decision Support Systems (CDSS) in a LCG2/GT2 [1,2] Grid environment, using web-based protocols. A CDSS is a system that provides a classification of the patient illness according to the knowledge extracted from clinical practice and using the patient’s information in a structured format. The CDSS classification engines can be installed in any site and can be used by different medical users from a Virtual Organization (VO) [3]. All users in a VO can consult and execute different classification engines that have been installed in the Grid independently of the platform, architecture or site where the engines are installed or the users are located. The present paper present a solution to requirements such as short-job execution, reducing the response delay on LCG2 environments and providing grid-enabled authenticated access through web portals. Resource discovering and job submission is performed through web services, which are also described in the article. Keywords. LCG2, Web Services, MDS, Clinical Decision Support
1. Introduction Clinical Decision Support Systems (CDSS) [4,5] are key tools for evidence-based medicine [6], where the quantity of information is large and correlations that permit extract conclusions are hard to find. CDSS are usually general-purpose systems that, once trained, can be used in different areas such us the classification of soft-tissues, anemias, topology etc… A CDSS consists on two parts, the part related to the training of the classificators and the use of these trained classificators to classify input data. The capabilities of a CDSS are defined by the parameters of the engine: Corpus, Classification and Method, which are also related to its performance. The Corpus corresponds with the type of task that the engine solves (as Soft-Tissue Tumours or Ferrophenic anemia classification). Classification refers to the set of return values of the engine (the Soft Tissue Tumours engines can return different output values, such as Benign or Malign, or the nature of the tissue and the Ferrephenic anemia can return the type of anemia as Apha Thalassemia, Beta Thalassemia, Beta-Delta Thalassemia, Ferropenic Anemia, Ferropenia or
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
81
Nomal) and Method corresponds with the technique that has been used for training the engine. Other important features correspond to the monitor parameters of the engine or the Efficacy, which corresponds with the successful rate of the classification of new cases considerer after the training. The training is the first part of a CDSS, being a complex and computationalintensive task that requires a large amount of resources and the supervision of experts. Medical experts provide a database that has been correctly evaluated, which is used for training the different classification engines to tune the parameters and methods up, requiring the adaptation of the data to an adequate format (as an example, the engines that detect Microcytic anaemia have input parameters as red blood cell count or haemoglobin and the engines for soft tissues tumours classification require information localization of tumour or patient age). The accuracy on the decisions depends on an adequate training. A bad training of the engines ends up with an unreasonable high amount of incorrect decisions. The techniques used are artificial neural networks, Maximal Likehood and other heuristic methods, whose parameters depend on the type of problem that has been treated (as cancer detection [7–9], pulmonary nodules [10] etc…). The second part, described in this paper, consists on the use of the trained classification engines by medical users for classifying new cases. The objective of the work presented in this paper is to foster the use of classification engines by enabling a remote usage. A specific interface is defied for providing the input data to the different classification engines, and executions are performed on the Grid. The Grid Middleware used in this project is LCG2 [2], which is accessed through Web technologies (Web Services). The access to the system is managed through Grid services and considering the right privileges for each virtual organization (VO) defined. 2. State of the Art Remote access to processes is a problem that has been extensively analyzed. Many solutions have been proposed to access remote objects and data. An object-oriented precursor standard for remote procedure call is CORBA [12] (Common Object Request Broker Access). CORBA eases the remote execution of processes with independence of architecture and location. CORBA shares with GRID concepts such as resource discovering, integrity, ubiquity or platform independence. CORBA is supported by all main manufacturers for distributed interoperability of objects and data access. The consolidation of the Web as a main infrastructure for the integration of information opened the door to a new way for interoperating through web-based protocols. Pioneer web services started using Web Interface Definition Language (WIDL), a first specification for describing remote Web services [11]. The evolution of WIDL leaded to Simple Object Access Protocol (SOAP) [13], which has support for messageoriented as well as procedural approaches. SOAP provides a simple and lightweight mechanism for exchanging structured and typed information between peers in a decentralized, distributed environment using XML. Last evolution in Web Services is the use of Web Services Description Language (WSDL). It is a simple way for service providers to describe the basic format of requests to their systems regardless of the underlying protocol (such as SOAP or XML) or encoding (such as MIME). WSDL is a key part of the effort of the Universal Description, Discovery and Integration (UDDI) [14] initiative to provide directories and descriptions of services for electronic business.
82
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
As defined in [19], a GRID provides an abstraction for resource sharing and collaboration across multiple administrative domains. Resources are physical (hardware), informational (data), capabilities (software) and frameworks (GRID middleware). The key concept of GRID is the Virtual Organization (VO). A VO [20] refers to a temporary or permanent coalition of geographically dispersed individuals, groups or organizational units (belonging or not to the same physical corporation) that pool resources, capabilities and information to achieve a common objective. The VO concept is the main difference among GRID and many other similar technologies. GRID is justified when access rights policies must be preserved at each site, without compromising the autonomy of the centres. GRID technologies have evolved from distributed architectures. Recent trends on GRID are being directing towards Web service-like protocols. Open GRID Services Architecture (OGSA) [15] represents an evolution towards a GRID system architecture based on Web services concepts and technologies. OGSA defines uniform exposed service semantics (the GRID service); defines standard mechanisms for creating, naming, and discovering transient GRID service instances; provides location transparency and multiple protocol bindings for service instances; and supports integration with underlying native platform facilities. The OGSA also defines, in terms of WSDL, the mechanisms required for creating and composing sophisticated distributed systems, including lifetime management, change management and notification. Although OGSA seems to be an adequate environment; many problems need to be faced for achieving good performance, higher security and true interoperability. Globus GT2 is the basic platform for many Middlewares that have build up improving and extending the services. In the DATAGRID [16] project, GT2 was extended to support a distributed storage of a large number of archives, also providing improved performance for VO management, job submission and monitoring, job scheduling and user access. The resulting Middleware (EDG) was also extended in the LCG and the Alien projects to fulfil the requirements of the High Energy Physics Community. The CDSS uses mainly the job submission and the information management services of LCG2 and GT2. Notwithstanding GT2-based Grid architectures are not designed to work on web environments, and although a service-oriented architecture would have been more appropriate, the availability of a production European-wide platform such as EGEE determined the choice of LCG2 as basic Middleware. Moreover, the next Middleware that will be deployed on EGEE will be service-oriented (gLite [17]), and thus the CDSS would fit better when this Mw would be available. The advantages of using Grid technologies rather than basic web services are on the persistence of the state in Grid services, the management of security through VOs and on the exploitation of the capabilities of the Grid information services, such as the BDII (Berkeley Database Information Index) or the R-GMA (Relational Grid Monitoring Architecture).
3. Architecture 3.1. General Architecture The architecture defined comprises four layers. The first layer is the client application layer, providing a user interface for interacting easily and securely with the Grid where
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
83
Figure 1. View of a general scheme of the architecture used in this work.
the resources (the engines) are installed. These engines can be consulted and executed by using these applications. The second layer acts as a gate to the Grid implemented thought a server that enables the users to connect the Grid environment transparently by Web Services. The third Layer is the Grid environment comprising two different parts, the searching system to discover available resources (the engines installed in the Grid), and the engine execution system, which also retrieves the results from the executions. 3.2. CDSS Client Layer This layer contains the components that enable the users to search the engines installed in the system, which fulfil specific criteria defined by the user (corpus, classification type, efficacy level…). When searching has been completed and the engines have been located, the user can run them on the Grid, using the input parameters in the format defined by the case, and retrieve the results of the classification. The client layer accesses web services through a secure and authenticated web protocol (https [23]). The security mechanism is presented at Section 3.6 of this paper. The client can be any application that is able to connect and discover the services defined in WSDL for the creation of proxies that interact with them (Java applications, .NET etc…). In the case of the architecture used in the present work (shown in Figure 1), the applications are stored in a Java [18] Web start server to ensure the use of the last version of the application. The result section of this paper shows the Java application (Figure 4) that accesses the grid environment through the implemented Web Services. 3.3. Gate to Grid Layer The Gate to Grid layer corresponds to the server that provides the users of the CDSS applications with a user-friendly and secure Web view of the Grid environment.
84
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
Figure 2. Scheme of Gate to Grid Layer.
The Gate to Grid has two interfaces. On one hand, it provides a web interface for interacting, using a container, which includes services for submitting jobs to the Grid environment through a specific user that permits to guarantee the security. On the other hand the second interface of the gate to Grid corresponds to the user-interface of LCG (Grid environment), installed in the same server as the web services container. 3.4. Searching System The engine searching is based on the Monitoring and Discovery System (MDS) [21] of Globus 2. This system publishes the information of the engines installed in the Grid/LCG. This information can be consult regardless of the site in the Grid. Each engine is installed on a computational node, namely Computer Element (CE), in the Grid. A CE can have several engines installed. All engines have an XML document containing all the information that corresponds to the engine (engine name, corpus, efficacy, monitoring parameters, features of engine etc…). These documents are the engine reports and consist on an individual document for each engine. Reports are installed at the same CE where the engine is installed. The CE Grid component includes a Grid Resource Information Service (GRIS) [21] and a Grid Index Information Service (GIIS) [21]. The GRIS has all the information of a CE (architecture, memory etc…) and also publishes the information added through provider procedures. The provider procedures get the information related to the services particular of the CDSS and defined by the administrator using a LDIF [22] structure, adding this information this to the GRIS. The GRIS component is an LDAP server [22] that stores the consolidated information formatted using an LDIF structure. The provider procedure defined gets the information of each engine installed and contained in an XML report, and adds this information to the GRIS with the LDIF structure previously defined. The GIIS is another LDAP server that contents all the information of the GRIS and other information about the Grid components of the LCG infrastructure. The information of the GIIS is updated by the GRIS asynchronously. Another important Grid component of the LCG middleware is the BDII. This component is specific of LCG, although GRIS and GIIS are part of Globus 2 and thus part of LCG. The BDII gets the information of all GIIS and all CEs of the Grid. It is also an LDAP server that consolidates the information provided by the GIIS, which it is updated asynchronously. When a search process is performed, and knowing the fields that are published in the MDS (GRIS, GIIS and BDII), engines can be selected. The search is started by the user through Web services defined at the Gate to Grid components, which query the BDII through the user interface and retrieve the results of the information engines.
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
85
Figure 3. Scheme of the Searching System.
3.5. Execution System The execution system runs the engines found in the Grid by the searching system. The user runs the classification with the input parameters that determine each case. The request for execution is encoded in an XML document with all the information required (input parameters, engine selected, etc...). This document is sent through a web service to the Gate to Grid which translates the XML document to LCG commands that submit the jobs through the user interface. The returns values are encoded in an XML document in the tag related to output parameters and finally returned to the user. 3.6. Security With respect to security two parts must be taken into account. On one hand, the security within the Grid environment and on the other hand, the security of web services implemented. Grid security is the Grid Security Infrastructure (GSI) and enables the authentication of users and into the network where the Grid is installed. The GSI is integrated into LCG since GSI is part of GT2, and LCG is based on it. GSI allows authentication and secure communication on open networks. GSI provides several services for the Grid that enable unique authentication. GSI is based in the encryption of public keys (or asymmetric encryption), X.509 certificates, and the SSL (Secure Sockets Layer) communication protocol. Extensions for these standards have been added to permit unique authentication and delegation. GSI provides the capability of delegation: an extension of SSL protocol that minimizes the number of times that the user must introduce the access key. Grid computing requires the use of different resources (requiring authentication), as if they were the actual user. The request of introducing the access key for each access is evaded creating a proxy. A proxy consists on a certificate (including the public key) and a private key, which determinates the identity of a user. The certificate is signed by the Certifi-
86
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
cate Authority (CA) [23]. Moreover, a certificate includes a temporal mark that indicates when the permissions delegated to the proxy will be revoked. On the other hand, it is also important to guarantee the security of web services and the authentication of users. This problem is solved using a standard web secure protocol (https). This protocol is based on SSL and guarantees the secure transfer of data. It is also based on the encryption using pairs of keys and X.509 certificates. The container of web services contains a certificate that permits the connection thought secure channels and each user has a personal certificate that guarantees his/her unique identity of user that access. X.509 certificates are signed by a certificate authority CA [23] trusted by the platform. When a user submits an execution of a task through web services, the user is identified with his/her personal certificate. This certificate should be obtained by converting the Grid certificated issued by a CA trusted by the Grid. The use of the browser certificate only gives access to the web service. The Distinguished Name (DN) is extracted from the certificate and compared to the list of DNs authorised to use the service (listed in the VO database). If the user is included in this list, the user can access the Grid through a proxy. Currently all web users are mapped into one Grid user. This mapping from web users to Grid users in the Gate to Grid server guarantees the security of the system and permits interacting with the grid through web-based protocols, being a transparent for the users.
4. Results An application that searches and executes services installed in a Grid for the classification of medical data is presented, describing the web protocols used to access the Grid. A software library has been implemented in client layer that interacts with the web services provided by the Gate to Grid server. Client certificates for each test user that requires accessing the Grid platform presented in this work have been generated. The application has two parts, on one side, the interface for searching, which enables searching the different engines using the process described at point 3.4, and on the other side, the application has an interface for running the engines that has been previously selected, using specific input parameters that are filled-in by the user. The system deployed for the test has a Gate to Grid server with a web server container where the web services are installed. These services are used for connecting the users with the Grid platform through a User Interface (UI) that is installed at the same server. The Grid platform deployed comprises by two Computer Elements, a Berkeley Database Information Index and the server with the UI. The network used for the test is a TCP/IP of 100 Mps. Seven engines have been installed in the test platform. Four are devoted to the analysis of microcytic anemies (Alpha Thalassemia, Beta Thalassemia, Beta-Delta Thalassemia, Ferropenic, Ferropenia and Normal) and three of them are devoted to the classification of soft Tissue Tumours (Benign and Malign). All these engines are distributed in different CEs installed at the Grid. The application interface for searching and executing available engines are showed in Figures 4 and 5. The interface showed at Figure 4 enables the searching of engines using the criteria published in the Monitoring and Discovery System (MDS). In the test platform the fields published are Corpus, Classification, Efficacy and Method.
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
87
Figure 4. Search System User interface.
Figure 5. Execution System User interface for microcytic anaemia classification engines.
Figure 5 shows the user interface for running the selected engines, once an engine has been found. The interface changes depending on the engine selected since the input arguments differ from case to case. The image on the Figure 5 shows the interface for microcytic anaemia classifycation engines.
88
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
5. Conclusions This paper describes the design and implementation of a Grid-enabled application for Clinical Decision Support. This application makes use of the Grid not only for training the classification engines, but also for the discovering of available resources and for the access to the application. The information system of the Grid is extensively used for the indexing and dynamic registration of services. The most relevant conclusions of the work presented are: • •
•
•
•
Medical users can access a Grid platform transparently through web protocols. Medical users can search classifications engines and run them on open networks (as Internet). This access is provided using a software library that contents all the methods needed for searching and executing them, so the use of the Grid is open to new applications. The maintenance of the system is simple. The installation of new engine in a CE of the Grid only requires a coping the executable and the associated XML document containing all its information. When these through the MDS the files are copied, the system is updated automatically and asynchronously. The security and confidentiality of the data in the system is guaranteed. Users have to install browser certificates converted from original Grid certificates and authorisation is granted by consulting the VO list of members, in a seamless integration with the Grid security. Grid technologies are suitable for this type of problems since users can access and manage classification engines it they were installed locally, and the Grid provides the appropriate means for sharing resources transparently and securely.
6. Future Work The next steps planned for the continuation of the work presented in this paper are described in the following points: • • •
• •
To integrate the training of the engines in the services provided by the software library. It is the tasks with higher computational cost and need, many resources for reducing the computational time. To open the architecture to new areas related with pattern recognitions, as searching for correlation between genetic alterations and schizophrenia. To implement a finer-grain procedure for user mapping. Currently, the system maps different web users to a unique Grid user. The system can be modified to map a web user to the specific Grid user, reducing the risks in security. Moreover, groups can be defined sharing common security policies beyond individual rights. Support of one architecture defined in other Grid platforms for interesting Grid platforms are OGSA/OGSI and OGSA/WSRF that can be adequate for coping with this problem since they are based on web protocols. To integrate new patterns recognition engines into CDSS, deploying them in a medical environment.
I. Blanquer et al. / Clinical Decision Support Systems (CDSS) in GRID Environments
•
89
To develop new applications in the client layer using the new patterns recognition engines that will be integrated.
References [1] Globus alliance Home Page. “Relevant documents”, http://www.globus.org. [2] LHC Computing Grid Project. http://lcg.web.cern.ch/LCG. [3] “The Grid Blueprint for a New Computing Infrastucture”, Ian Foster, Ed. Morgan-Kaufmann, ISBN: 155860-933-4, July 1998. [4] Clinical decision-support systems. Van der Lei J, Talmon JL (1997). In: van Bemmel JH, Musen MA (eds) Handbook of medical informatics Springer, Berlin Heidelberg New York, pp 261-276. [5] Methods for decision support. Van Bemmel JH (1997). In: van Bemmel JH, Musen MA (eds) (1997) Handbook of medical informatics. Springer, Berlin Heidelberg New York. [6] A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Net 15:11-39. Lisboa PJG (2002). [7] Artificial neural network: improving the quality of breast biopsy recommendations. Radiology 198:131135.Baker JA, Kornguth, PJ, Lo JY, Floyd CE (1996). [8] Neural networks analysis of breast cancer from MRI findings. Radiat Med 15:283-293. Abdolmaleki P, Buadu LD, Murayama, S et al (1997). [9] Prediction of breast cancer malignancy using an artificial neural netwok. Cancer 74:2944-2948. Floyd CE, Lo JY, Yun AJ, Sullivan DC, Kornguth PJ (1994). [10] Solitary pulmonary nodules: determining the likelihood of malignancy with neural networks analysis. Radiology 196:823-829. Gurrney JW, Swensen SJ (1995). [11] Creación de servicios Web XML para la plataforma Microsoft.NET/Scott Short.-Madrid [etc.]:McGraw-hill/Interamericana de España, D.L. 2002.-XVI, 319 p.; 24 cm.+ 1 CD-Rom. ISBN 8448137027. [12] “Object management Group.” “CORBA Corner”. www.omg.org. [13] “Simple Object Access Protocol (SOAP)”, http://www.w3c.org. [14] “Universal Description, Discovery and Integration (UDDI)”, http://www.uddi.org. [15] Ian Foster, Carl Kesselman, Jeffrey M. Nick, Steven Tuecke, “The physiology of the Grid, An Open Grid Services Architecture for Distributed Systems Integration”, Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002, http://www.globus.org/research/papers/ogsa.pdf. [16] The DATAGRID project. www.eu-datagrid.org. [17] Enabling Grids for E-science in Europe. http://public.eu-egee.org. [18] La bíblia de Java 2 / Steven Holzner. - Madrid : Anaya Multimedia, D.L. 2000. - 959 p.; 23 cm. + 1 disco compacto - (La Biblia de) ISBN 8441510377. [19] Expert Group Report, “Next Generation GRID(s)”, European Commission. http://www.cordis.lu/ ist/grids/index.htm. [20] Foster I., Kesselman C., “The GRID: Blueprint for a New Computing Infrastructure”, Morgan Kaufmann Pulishers, Inc., 1998. [21] Globus alliance Home Page” Software-Information Services. http://www-unix.globus.org/toolkit/mds/. [22] OpenLDAP. OpenLDAP Software is an open source implementation of Ligthweigth Directory Access Protocol. http://www.openldap.org. [23] Web security and commerce / Simson Garfinkel. - Cambridge: O’Reilly, 1997. - 483 p.; 23 cm. ISBN 1565922697. [24] Pattern classification. Wiley, New York. Duda RO, Hart Peter E, Stork David G (2001). [25] Neural networks for pattern recognition. Oxford University Press, New York. Bishop CM (1995).
90
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Grid-Enabled Biosensor Networks for Pervasive Healthcare I. CHOUVARDA, V. KOUTKIAS, A. MALOUSI and N. MAGLAVERAS Lab of Medical Informatics, Aristotle University of Thessaloniki, Greece
Abstract. Current advances in biosensor technology allow multiple miniaturized or textile sensors to record continuously biosignals, such as blood pressure or heart rate, and transmit the information of interest to clinical sites. New applications are emerging, based on such systems, towards pervasive healthcare. This paper describes an architecture enabling biosensors, forming a Body Area Network (BAN), to be integrated in a Grid infrastructure. The Grid services proposed, such as access to recorded data, are offered via the BAN console, an enhanced wearable computer, where the recordings of multiple biosensors are integrated. Medical Grid-enabled Nodes can have access to biosensor measurements upon demand, or can agree to get notifications and alerts. Thus, in such a distributed environment, data and computational resources are independent, yet cooperating unobtrusively, contributing to the notion of pervasive healthcare. Keywords. Biosensor, Body Area Network, Pervasive Healthcare, Grid Services
1. Introduction The vision of ubiquitous computing requires the development of devices and technologies, which can be pervasive without being intrusive. New biosensor technologies can help towards unobtrusive and pervasive patient monitoring. The basic components of such an environment are small nodes with sensing and wireless-communication capabilities, able to organize flexibly into a network for data collection and delivery. In general, communication in wearable networks is characterized by varying performance requirements and much stricter power consumption constraints than we meet in most standard mobile systems. The network has to combine sensors and processing modules of the medical applications placed at different locations on the user's body forming a distributed heterogeneous system. Some of the challenges regarding the development of a biosensor network are related to effective data sharing, scalability, and security policies. Grids constitute a beneficial perspective to tackle these requirements. The idea in Grid-based distributed computing environments is the provision of a flexible, controlled, secure, and coordinated resource-sharing framework among dynamic collections of individuals, institutions, and resources, referred as Virtual Organizations (VOs) [1,2]. Resource-sharing among VOs concerns direct access to computers and software, data, knowledge and other resources, as required by a wide range of collaborative problem-solving and resource-brokering strategies. Within a Grid environment, resource-sharing is highly controlled, with resource/service providers and requesters defining clearly what is accessed, who is allowed to access, and the conditions under which sharing may occur.
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
91
Specifically, controlled access to medical data, obtained by the biosensors in a pervasive health monitoring system, may be provided via a Grid infrastructure. Among the most interesting options of such a scheme is the execution of advanced distributed queries performed on the medical data, or queries based on medical data semantics. Although such queries can be also performed by Web Services, the introduction of Grid technology would allow, through the definition of behaviours, active participation of all nodes and dynamic distributed resource management. Different approaches could be proposed, such as the direct participation of each biosensor in a Grid, or the participation of a group of biosensors in a Grid, i.e., via a representative node. Some initial steps towards this direction can be found in literature. In [3], scalable and flexible remote medical monitoring applications on the Grid are described, by adapting Globus Toolkit 3 (GT3) and using intermediate devices that support Grid services related to sensor functionality. The Open GIS Consortium (OGC), are building an open platform for exploiting Web-connected sensors, and propose SensorML as information model and encoding for discovering, querying and controlling Web-resident sensors [4]. On the other hand, the Grid Monitoring Architecture (GMA) working group is focused on producing a high-level architecture statement of the components and interfaces needed to promote interoperability between heterogeneous monitoring systems on the Grid [5]. In the current work, an architecture is proposed, supporting the indirect participation of biosensors in a Grid. Due to the general limitations regarding the development of Grid-enabled biosensors, a computer system is alternatively proposed, playing the role of a base-station for integrating the networked group of biosensors in a Grid framework, interacting with other Medical Grid Nodes. The notions of Producer and Consumer of data are adopted and the services of the corresponding Grid Nodes are described. This perspective constitutes a realistic scenario for an indirect means of biosensor participation in a Grid infrastructure, further illustrated by the application scenario elaborated in the proposed work.
2. The Biosensor Network A wearable platform enabling the continuous monitoring of signals such as ECG, blood pressure, respiration and motion activity is assumed to consist the Body Area Network. The system mainly consists of a group of biosensors placed in different locations of the human body to best acquire the desired signals and the BAN console, a central console, which is a kind of wearable computer. In this manner, multiple sensors are integrated in a BAN, forming a local sensor network (Figure 1). The BAN console, controlling sensing and communications, plays an important role, especially from the viewpoint of the intelligence incorporated, in order to reassure a system with enhanced capabilities, enabling secure and seamless communication. However, this BAN is required to be invisible to the patient, following the key direction towards ambient intelligence and pervasive healthcare services. Thus, typically, the BAN console is a lightweight device with efficiently-powered computational capabilities, which does not interfere with a person’s ordinary comfort or activities. As far as communication is concerned, there are basically two types of wireless communication links in the proposed biosensor network: a) between the biosensors and the BAN console, e.g., via Bluetooth, and b) between the BAN console and the Medical Community, e.g., via WLAN. Regarding the last issue, i.e., communication between the BAN console and the Medical Community, which has to be wireless in a
92
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
Figure 1. The BAN console and the biosensor network.
pervasive system, there are two possible scenarios, which can be followed exclusively or in a combined mode: a) the BAN console integrates the measurements of the sensors, processes them and results in some overall classification of the patient condition, and in some notification for the medical personnel, especially in urgent situations and b) the BAN console only collects measurements and forwards them to the clinical site, where archiving and processing takes place. While the latter case implies a passive store-andforward role for the BAN, in the current work focus is placed on an active BAN console, participating as an independent actor in a distributed health system. 3. Grid Added-Value for Medical Sensor Networks Such a multisensor telemedicine system (Figure 2) has characteristics that require careful handling. The system is distributed, in the sense that many BANs have to interact with many medical clients for data or information retrieval. There is local medical data processing in the BAN and possible interaction with medical personnel, i.e., computational resources are also distributed. Each BAN might be different and might be further extended. Resource management, in terms of storage, computation and communication, has to be optimized, due to the low resources of sensors networks. Roles and access requirements are complex, however, security is an urgent issue. Service provision has to be controlled. Therefore, a formalization of the architecture and the services is required. Web Services approach is becoming a dominant mechanism for distributed application interfaces. Combining Grids with Web Services brings dynamic resource management to Web Services, by adding behaviours (ports) defined to do service initiation and lifecycle management of transient, stateful services – i.e., dynamically created services. The proposed Grid-based architecture can contribute towards the aforementioned requirements of a multisensor telemedicine system. Typically, a BAN would work in a store-and-forward way. In the proposed setup, information can be filtered and circulated asynchronously, upon demand, and as a result of negotiation. Services can be initiated by the BAN or by the relevant Medical application. Among the benefits of a Grid-enabled environment adoption, especially in the context of healthcare applications [6], are to:
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
93
Figure 2. The abstract scheme of the telemedicine monitoring system based on the biosensor network.
1. 2. 3. 4.
facilitate the distribution of biosensor data over decentralized resources, enforce the use of common standards for secure and consistent biomedical data interchanges and provide robust environments, enrich the application fields enabling large-scale analysis, such as epidemiology studies, and allow distributed institutes to access the resources (processing modules and data repositories), enabling the performance of large-scale remote experiments.
4. An Architecture for Grid-enabled Biosensor Networks In the proposed scheme, which is based on the concepts proposed by the GMA specification [5], there are four types of distinguished parts: • •
•
•
Directory Service: supports information publication and discovery. Producer: makes sensor data available. In this case, the BAN console of the sensor-network consists a Grid Node and has the role of the “Producer”. The BAN console is viewed as a “hyper-sensor” connected to the Grid, integrating and making available all the sensed data of the biosensor-network, as well as the interpretation results (and possible alerts, notifications) of local data fusion and processing. Consumer: receives sensed data or processing outcome. Medical Grid Nodes that implement different kinds of Consumer interfaces can request and have access to the medical or technical data of each BAN console, of interest to the Medical Community, as well as the technical staff related to the telemedicine service. Intermediaries: support the resource and load management.
Data discovery is separated from data transfer by placing metadata about Consumers and Producers in a universally accessible location, called here a “Directory Service”, along with enough information to initiate the communication between them.
94
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
Figure 3. Grid-enabled BAN and Medical Nodes scheme.
When this information is retrieved from the Directory Index, communication is established between Consumer and Producer and data, which make up the majority of the communication traffic, travel directly from their Producers to the relevant Consumers. The system can be extended so that BAN Nodes can be both Producers and Consumers, and so can be Medical Nodes. The components and their interactions (Figure 3) are further described in the following sections. 4.1. Directory Service Interactions Producers and Consumers publish their existence in Directory Service entries. The Directory Service, or Registry, is used to locate Producers and Consumers. Information on the services supported by a given Producer would be published in the Directory Service, along with the service data information. It has to be noted that the Directory Service holds both the schema and the actual records of Consumers and Producers, and can be searchable by different criteria. The schema of the service data also resides in the Directory Index, although the actual data reside in the Grid Nodes. Consumers can use the Directory Service to discover Producers of interest, and vice versa. Either a Producer or a Consumer may initiate the interaction with a discovered peer. In either case, communication of control messages and transfer of data occur directly between each Consumer-Producer pair without further involvement of the Directory Service. Therefore, a description of actors is required to efficiently provide the information about the target of interest. For example, when a BAN Node looks for medical personnel, Medical Nodes need to be described according to the specialization and the relation to the specific disease monitored by the BAN, even by their authorization to access data. When a Medical Node wants to query monitored data, the BANs have to be described in a way that makes explicit whether they can offer the requested data.
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
95
4.2. Producer/Consumer Interactions Three types of interactions are defined for transferring data between Producers and Consumers: publish/subscribe, query/response, and notification. The publish/subscribe interaction consists of three stages. In the first stage, the initiator of the interaction (this may be either a Producer or a Consumer) contacts the “server” (if the initiator is a Consumer, the server is a Producer, and vice versa) indicating interest in some set of data. The mechanism for specifying data of interest has to be based on a formal representation of the sensors’ outcomes and the whole BAN/user profile. At this point, the state in both the Producer and Consumer is called a subscription. Additional parameters needed to control the data transfer are also negotiated in this stage. For example, conditions may be supplied by the Consumer to the Producer with the subscription request, and, thus, the Producer could send data only when the conditions are met. In the next stage of the interaction, the Producer sends data to the Consumer. Finally, either the Producer or Consumer terminates the subscription. For the query/response interaction, the initiator must be a Consumer. The interaction consists of two stages. The first stage sets up the transfer, similar to the first stage of publish/subscribe. Then, the Producer transfers all the requested data to the Consumer in a single response, much like the request/reply protocols in HTTP. The notification interaction is a one-stage interaction, and the initiator must be a Producer. In this type of Producer/Consumer interaction, the Producer transfers all the relevant data to a Consumer in a single notification. The notification might trigger a new query/response interaction. 4.3. Intermediaries A consequence of separating data discovery from data transfer is that the protocols used to perform the publish/subscribe, query/response, and notification interactions described above can be used to construct intermediaries that forward, broadcast, filter, or cache the monitored medical information. This is particularly important in the case of biosensor networks, since optimal energy resource management is required. When Producers’ data are requested by many Consumers, these intermediate components can undertake some of Producers’ load, with subsequent reductions in the network traffic. The building block for these advanced services can be the compound Producer/Consumer, which is a single component that implements both Producer and Consumer interfaces. Specifically, job scheduling and data cache services would be useful in such a component. For example, a Consumer interface might collect data from several Producers, use that data to generate a new derived data type, and make that available to other Consumers through a Producer interface. 4.4. Benefits of the Proposed Scheme For an eHealth system, including many users with wearable biosensor systems (many Producers), many sensors consisting each biosensor system (that have to be “represented” by the BAN console), multiple medical applications and users (many Consumers), the proposed scheme can support self-organizing, self-healing collections and complex Producer subscriptions. Among the general properties of the scheme, is that it preserves local autonomy of resource owners, i.e., distributed remote storage and execution. Information flow con-
96
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
Table 1. Description of Services and Service Data Elements. Service
Methods of Invocation
Medical Node service (general)
Locate_BAN, Subscribe_to_BAN, UnSubscribe_from_BAN, Maintain_registration_to_DI, Accept_Alert, Accept_subscribe_from_BAN, Request_sensor_measurement, Request_technical_status
BAN service
Maintain_registration_to _DI, Accept_subscribe_from_Medical_Node, Accept_unsubscribe_from_Medical_Node, Locate_Medical_Node, Subscribe_to_Medical_Node, UnSubscribe_from_Medical_Node, Get_conditional_sensor_measurements, Get_aggregated_sensor_status, Get_conditional_technical_status, Send_alert
trol is distributed. Resource discovery mechanisms can be supported, that enable locating, co-scheduling and dynamic management of many resources. Resources registration and de-registration can be automatic. Regarding the Consumers, i.e., Medical Nodes, they can dynamically discover the Producers (BAN consoles or “hyper-sensors”) of interest, and accordingly query them and receive medical or technical data. The Producers can accept subscriptions, support query and filtering, deliver data, and manage their registration. The transmitted data, based on subscriptions, are structured. On the other hand, interaction can be initiated by the Producers. Task scheduling can be further assisted by intermediaries. Semantic representation of the resources does not only cope with the heterogeneity of the data, but also with the controlled co-operation among diverse actors and technologies. 4.5. Services and Schemas As already mentioned, the BAN console can be viewed as the “hyper-sensor”, integrating a set of measurements coming from different biosensors connected (wirelessly) to the local biosensor network. Therefore, the schema of the BAN console has to reflect all this nested information actually deriving from multiple sources. The BAN Service is considered as a service aggregating sensor information, in the sense of status and data. Furthermore, it makes available overall interpretation and alert information. The Medical Node should be described according to the organization it represents and the diseases it is responsible for monitoring. Table 1 and Table 2 describe in more detail the services and related data elements included.
5. An Application Scenario A system of biosensors is assumed, for monitoring patients with hypertension and sleep apnea during activities of daily life or during sleep respectively. Multiple sensors (measuring, for example, heart rate, blood pressure, oxygen saturation, respiration, etc.), even different for each patient, are connected with the BAN console, which constitutes the Grid Node. The BAN console processes locally the recorded signals and produces an interpretation, resulting in information about patient’s situation at a given time, and, eventually, alerts in case the situation is urgent. The medical personnel needs to be informed about the critical conditions and only occasionally retrieve the raw data recorded. In such a system, there are many types of candidate Consumers. A few are listed in Table 3, as illustrative examples.
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
97
Table 2. The complex data types that should be defined in the corresponding XML schema describing the Service Data Elements (1st row) and the part of the GWSDL file describing the BAN Service, corresponding to the Service Data Element that should be included (2nd row).
Some example interactions among the BAN Grid Nodes and the Medical Nodes are mentioned below, to illustrate the scenario. •
For sleep apnea patients, the technical personnel (Consumer) needs to reassure the good operation of each multi-sensor Node (Producer) before sleep, so a query to all the relevant Nodes is made, asking for the technical validation
98
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
Table 3. Types of candidate Consumers in the application scenario. Name
Description
Technician
Collects scheduled periodic information about the technical condition of the biosensors.
Medical Alert Processor
Receives notifications and alerts from BAN consoles and processes them properly, i.e., implements an interface that can be accessed by the clinicians for requiring medical advice, or automatically triggers some further action.
Medical Data Archiver
Aggregates and stores medical data in long-term storage for later retrieval or analysis. An Archiver might also act as a Producer, when the data is retrieved from storage.
Real-time Monitor
Collects medical data in real time for use by online analysis tools. An example is a tool that graphs (HR-BP) information in real-time.
Overview Monitor
Collects data from several sources and use the combined information to make a decision that could not be made on the basis of data from only one Producer.
•
• •
•
of each one. Each BAN corresponding to a sleep apnea patient will respond with a technical report. When some problem is located, the technical personnel is expected to act accordingly. For research, modelling and archiving purposes, the medical personnel (e.g., via a medical center of competence) may request the recorded data on a continuous basis for 24 hours. The intermediary will take the responsibility to fetch all the data to be sent to the Archiver, who will respectively notify the Medical Node, when this is accomplished. Accordingly, the Medical Node will require the data for processing. A physician may request hypertension-related data every night before sleep and early in the morning, for three days. When the condition of a hypertensive patient becomes rather urgent during intensive activity, the BAN sends a notification or an alert to the relevant medical personnel (corresponding physician). This notification initiates a request/response interaction for the next half hour to help the physician assess the patient’s condition. A physician may set a new alert condition “Send me Blood Pressure and Heart Rate data for ½ hour, when heart rate variability changes according to a specific pattern”. This condition is negotiated and accepted during the subscription procedure.
Specifically, the application scenario and the series of interactions among Grid Nodes are further illustrated in the following two examples. Example 1. Technical scenario: “obtain the current technical condition of the sensor networks for sleep apnea patients related to our organization”. The Medical Node searches Producer schema in Directory Index for registered BANs monitoring “sleep apnea” patients from the specific organization and technical information is supported via a specific interface. For each one of the located BANs, the Medical Node initiates query with Request_technical_status, based on the information known about the BAN schema. Each BAN Node accepts query, performs relevant technical check, and responds with Get_aggregated_sensor_status, which has information as structured in the technical condition SDE about all the sensor elements. The virtue of the proposed design towards semantics enriched services is highlighted by this example.
I. Chouvarda et al. / Grid-Enabled Biosensor Networks for Pervasive Healthcare
99
Example 2. Archiver scenario: “for the next half hour, retrieve the Blood Pressure for hypertensive patients related to our organization.” The Medical Node searches Producer schema in Directory Index for accessible registered BANs of hypertensive patients where Blood Pressure measurements can be available. The Medical Node subscribes to each one of the located BAN Nodes, for the specific type of sensor measurement. Accordingly, after accepting subscription, each BAN Node responds continuously with Get_conditional_sensor_measurement. After half an hour, the Medical Node unsubscribes from each BAN Node. This example illustrates service initiation and lifecycle management of dynamically created services.
6. Conclusion An architecture and an application scenario have been presented, applying to a Gridenabled biosensor network, based on the Producer/Consumer roles initially elaborated by the Grid Monitoring Architecture specification. A BAN console is considered as the Producer Grid Node, making available sensor data, while other Medical Nodes (the Consumers) occasionally require access to these data. The proposed scheme is considered as a realistic one, since the BAN console can have enough computational resources to support Grid containers, unlike individual biosensors. The notion of Active Service Providers is crucial in this architecture, providing a more generalized framework to meet scalability concerns. The adoption of Grid-based concepts is considered suitable and beneficial, due to the involvement of actors (Medical personnel and BANs) that may be considered as Virtual Organizations, managing their own resources and interacting in a decentralized but controlled environment, matching fundamental features of the Grid. As a future step, it is intended to test the proposed architecture in a healthcare scenario for monitoring hypertensive patients and patients with sleep disorders. In this case, security issues will be further investigated as well as roles and accessibility issues, i.e., defining which data are private and which are public. Furthermore, it will be of interest to elaborate more on the use of various intermediaries for better resource management.
References [1] [2] [3] [4] [5] [6]
Ian Foster , Carl Kesselman, Jeffrey M. Nick and Steven Tuecke “Grid Services for Distributed System Integration” June 2002 IEEE Computer: 37-46. Foster I, Kesselman C, Tuecke S. The anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomputer Applications 2001; 15(3) 200-222. C. Barratt, et al “Extending the Grid to Support Remote Medical Monitoring”, Proceedings of the 2nd UK e-Science All Hands Meeting 2003. OGC Discussion Paper titled “Sensor Model Language (SensorML) for In-situ and Remote Sensors” http://www.opengis.org/info/discussion.htm. A Grid Monitoring Architecture by the Global Grid Forum Performance Working Group. http://wwwdidc.lbl.gov/GGF-PERF/GMA-WG/papers/GWD-GP-16-2.pdf. HealthGrid WHITE PAPER, Chapter 1, Eds I. Blanquer, V. Hernandez, E. Medico, N. Maglaveras, S. Benkner, and G Lonsdale, http://whitepaper.healthgrid.org/.
100
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Biomedical Informatics Research Network: Building a National Collaboratory to Hasten the Derivation of New Understanding and Treatment of Disease Jeffrey S. GRETHE a,1, Chaitan BARU b, Amarnath GUPTA b, Mark JAMES a, Bertram LUDAESCHER b, Maryann E. MARTONE a,c, Philip M. PAPADOPOULOS b, Steven T. PELTIER a, Arcot RAJASEKAR b, Simone SANTINI a, Ilya N. ZASLAVSKY b and Mark H. ELLISMAN a,c BIRN Coordinating Center at the University of California, San Diego (www.nbirn.net)2 a Center for Research on Biological Structure (CRBS) b San Diego Supercomputer Center (SDSC) c Department of Neurosciences, UCSD School of Medicine
Abstract. Through support from the National Institutes of Health’s National Center for Research Resources, the Biomedical Informatics Research Network (BIRN) is pioneering the use of advanced cyberinfrastructure for medical research. By synchronizing developments in advanced wide area networking, distributed computing, distributed database federation, and other emerging capabilities of escience, the BIRN has created a collaborative environment that is paving the way for biomedical research and clinical information management. The BIRN Coordinating Center (BIRN-CC) is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of biomedical and clinical research being pursued by domain scientists in three neuroimaging test beds. Keywords. Data Integration and Mediation, Distributed Grid Computing, High Speed Networking, Imaging, Neurological Disease
1. Introduction The Biomedical Informatics Research Network (BIRN) is an initiative within the National Institutes of Health that fosters large-scale collaborations in biomedical science by utilizing the capabilities of the emerging cyberinfrastructure (high-speed networks, distributed high-performance computing and the necessary software and data integration capabilities). Currently, the BIRN involves a consortium of 19 universities and 27 research groups (Figure 1) participating in three test bed projects centered around brain imaging of human neurological disease and associated animal models. These groups 1
Correspondence to: Jeffrey S. Grethe, University of California San Diego, 9500 Gilman Drive – M/C 0715, La Jolla, CA, 92093-0715, USA; E-mail:
[email protected]. 2 Support: This work is supported by NIH BIRN-CC Award No. 1-U24 RR019701-01 (NCRR BIRN Coordinating Center Project BIRN005).
J.S. Grethe et al. / Biomedical Informatics Research Network
101
Figure 1. The National Institutes of Health (NIH) started the BIRN in September 2001 with three application-oriented test beds and a coordinating center. The above map represents BIRN as of January 2005, which is anticipated to grow with many more sites and to provide a framework for the open and widespread sharing of data.
are working on large scale, cross-institutional imaging studies on Alzheimer’s disease, depression, and schizophrenia using structural and functional magnetic resonance imaging (MRI). Others are studying animal models relevant to multiple sclerosis, attention deficit disorder, and Parkinson’s disease through MRI, whole brain histology, and high-resolution light and electron microscopy. These test bed projects present practical and immediate requirements for performing large-scale bioinformatics studies and provide a multitude of usage cases for distributed computation and the handling of heterogeneous data. The promise of the BIRN is the ability to test new hypotheses through the analysis of larger patient populations and unique multi-resolution views of animal models through data sharing and the integration of site independent resources for collaborative data refinement. The BIRN Coordinating Center (BIRN-CC) is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the scientific goals pursued by these test bed scientists. These components include high bandwidth inter-institutional connectivity via Internet2, a uniformly consistent security model, grid-based file management and computational services, software and techniques to federate data and databases, data caching and replication techniques to improve performance and resiliency, and shared processing, visualization and analysis environments. BIRN intertwines concurrent revolutions occurring in biomedicine and information technology. As the requirements of the biomedical community become better specified through projects like the BIRN, the national cyberinfrastructure being assembled to enable large-scale science projects will also evolve. The BIRN initiative has already established itself as a leading source of information about specific requirements that must be met by this rapidly evolving cyberinfrastructure in order to properly serve the
102
J.S. Grethe et al. / Biomedical Informatics Research Network
needs of basic biomedical research and translational or clinical research. As these technologies mature, the BIRN-CC is uniquely situated to serve as a major conduit between the biomedical research community of NIH-sponsored programs and the information technology development programs.
2. BIRN Infrastructure Supporting the Collaborative Projects of the Test Beds Currently in the BIRN project, biomedical researchers are in the process of standardizing imaging protocols, developing and populating databases around their data, defining and utilizing processing pipelines for upload and analysis of data and assembling large imaging caches. The use case scenario in Figure 2 portrays how a single experiment may flow through the BIRN and how it depends on many components of the BIRN cyberinfrastructure. The first step in the experimental process is the collection of the primary research data. Within the BIRN test beds, this data is collected and stored at distributed sites where researchers maintain local control over their own data. In order for researchers to begin collecting and sharing their data it was necessary to provide the hardware and software infrastructure for seamlessly federating the data repositories through the construction of the BIRN virtual data grid. In order to quickly satisfy these data sharing needs, a complete hardware (“the BIRN rack”) and software solution (“the BIRN Software Stack”) were prescribed so that researchers could begin this process as soon as possible. Currently the BIRN virtual data grid is fully operational, allowing all researchers within the BIRN consortium to share their data securely. Once the imaging data has been stored within the BIRN virtual data grid, users from any collaborating site must be able to process, refine, analyze, and visualize the data. In order to satisfy these requirements, the BIRN-CC has made significant progress in implementing an application integration environment. BIRN users are currently able to transparently access and process data through the BIRN portal, a workflow and application integration environment where applications can interact with the BIRN virtual data grid. This allows researchers to visualize and perform analysis on data stored anywhere within the BIRN virtual data grid. However, an important objective of the BIRN initiative is to provide the researcher with seamless access to the computational power required to perform large-scale analyses through the use of complex interactive workflows. The sequence of steps within a typical analysis pathway can consist of multiple workflows (e.g. there might be separate application pipelines for the data pre-processing and post-processing) and interactive procedures (e.g. manual verification of the data pre-processing). This complex interactive workflow may be required to utilize distributed computing resources (e.g. to expedite the processing of multiple data sets) while also allowing the researcher to perform any interactive procedures that are required. As data are processed, intermediary data and ultimately the final results will need to be stored back in the BIRN virtual data grid along with metadata describing the full history (“provenance”) of the data files. Much of the metadata information, along with results from statistical analyses, will be stored in databases being deployed at all test bed sites. As the BIRN cyberinfrastructure matures, the BIRN-CC must continue to enhance and extend the application integration and workflow environment so that researchers are able to more efficiently perform large-scale analyses of their data.
J.S. Grethe et al. / Biomedical Informatics Research Network
103
Figure 2. Use case scenario for a complete experiment within the BIRN cyberinfrastructure. The use case follows data flow all the way from data collection, to the BIRN virtual data grid, through processing stages, which is then followed by a query through the data integration layer.
As ever-greater amounts of data are assembled, scientists will need a means by which these data can be queried. The collection and analysis of data within the BIRN research community not only results in large data sets, but also highly heterogeneous ones. Consequently, one of the greatest challenges facing the BIRN-CC is to develop the means for BIRN scientists to query multiple data sources in complex ways that generate new insights. A major research thrust within the BIRN-CC has been to develop a data integration system to handle these queries. As with all other aspects of the
104
J.S. Grethe et al. / Biomedical Informatics Research Network
BIRN environment, the data integration architecture is being designed to interact with the application integration and workflow environment. The following sections provide an overview of the components of the BIRN infrastructure and activities of the BIRN Coordinating Center in deploying this infrastructure within the BIRN research community.
3. Overview of the BIRN Infrastructure To enable the above described collaborative environment, the BIRN Coordinating Center (BIRN-CC) is designing and deploying an architecture designed around a flexible large-scale grid model [1–5]. A grid is defined as a collection of network-connected resources that can be accessed, scheduled, and coordinated to solve problems. This collection of network-connected component resources is tightly integrated by an evolving layer of grid middleware technologies. While there are many models for grid computing, the BIRN-CC is building on the most widely accepted model for large-scale computing and is taking advantage of many ongoing research and development efforts in industry as well as efforts funded by the National Science Foundation and other government agencies. Grids are often defined as either data-centric or compute-centric and BIRN testbeds have aspects that require integration of distributed data and distributed resources. The strategy for the BIRN (and enabled by systems engineering at the BIRN-CC) has been to focus initially on the data-sharing aspects to rapidly enable collaborative investigations. We are rapidly maturing connections and integrations of BIRN data to compute facilities to build a secure, robust, and scalable collaborative environment. A key tenet in grid computing is to abstract hardware, software, and data resources as sets of independent services. As the BIRN evolves, the BIRN-CC will expose these services as standards-based, well-defined interfaces to its community. This requires enhancing, packaging, configuring, disseminating, and maintaining a comprehensive software stack. An additional requirement that BIRN must address are the many guidelines are regulations for dealing with sensitive data – e.g., human subjects data and raw, prepublication research data – making limited sharing, encryption and auditing critical within the overall infrastructure. By using widely accepted grid middleware, the critical and essential security and integrity mechanisms are integrated at all level of the software stack. The key components of this stack are described in the following sections. 3.1. Grid Middleware Grid middleware enables the assembly of geographically disparate resources into an application-specific virtual resource. The low-level grid software layer for BIRN is built upon a collection of community-accepted software systems and services distributed as part of the NSF Middleware Initiative (NMI). A key component of the NMI distribution, the Globus Toolkit [2–4] supplies the essential grid services layer that includes authentication, encryption, resource management, and resource reporting. Globus provides the computational and security components that allow test bed researchers to launch long running jobs to any available computing resources on the
J.S. Grethe et al. / Biomedical Informatics Research Network
105
BIRN grid without having to worry about where the job runs, which communications protocols are supported, and how to sign on to different computing systems. Grid middleware also provides the infrastructure to interconnect independent data collections, a fundamental requirement of scientific collaboration. Within the BIRN test beds, this need was met through the implementation of the BIRN Virtual Data Grid (BVDG) which: • • •
Provides a software “fabric” for seamlessly federating data repositories across the geographically distributed BIRN sites; Facilitates secure access and maximizes data input/output performance across the network; and Facilitates the collaborative use of domain tools and flexible processing/analysis workflows.
In architecting and building the software framework of the BVDG, the Storage Resource Broker [6] (SRB – developed at the San Diego Supercomputer Center) was selected as the most immediately amenable and mature technology for connecting distributed data to computational resources, users, and tools. The SRB is a technology implemented to provide a uniform interface for connecting to heterogeneous data resources over a network and creating and managing these distributed data sets. With this software, data files stored across a distributed environment can be seamlessly managed and manipulated as a virtual file system, where a file’s logical location within the virtual file system is mapped to its actual physical location. The BIRN Virtual Data Grid now handles over 5 million files distributed across all of the BIRN test bed sites. Atop these comprehensive middleware services, database mediation and integration tools are layered along with domain-specific tools and the BIRN community access layer, the BIRN Portal. 3.2. Data Integration and Information Mediation The Neuroscience research community deals not only with large distributed databases, but also with highly heterogeneous sets of data. A query may need to span several relational databases, ontology references, spatial atlases, and collections of information extracted from image files. To that end, the BIRN-CC has deployed a data source mediator that enables researchers to submit these multi-source queries and to navigate freely between distributed databases. The mediation architecture for BIRN builds upon our work in knowledge-guided mediation for integration across heterogeneous data sources [7–8]. In this approach, the mediator uses additional knowledge captured in the form of ontologies, spatial atlases and thesauri to provide the necessary bridges between heterogeneous data. Unlike a data warehouse, which copies (and periodically updates) all local data to a central repository and integrates local schemas through the repository’s central schema, the mediator approach creates the illusion of a single integrated database while maintaining the original set of distributed databases. This is achieved via so-called integrated (or virtual) views – in a sense the “recipes” describing how the local source databases can be combined to form the (virtual) mediated database. It is the task of the mediator system to accept queries against the virtual views and create query plans against the actual sources whose answers, after some postprocessing, are equivalent to what a data warehouse would have produced. The main advantages of a mediator system over a data warehouse are:
106
J.S. Grethe et al. / Biomedical Informatics Research Network
• • • •
User queries automatically retrieve the latest data from the source databases; There is no need to host, maintain, and keep updated the central repository; A mediator is flexible in that a new source can join a mediated system simply be registering its schema and integrated views can be added as needed, and Sites maintain the autonomy and ownership of their data, because the mediator is just another client to the source databases.
3.3. The BIRN Portal While computer technology is what makes it possible to conduct large-scale data analyses, most biologists are far more concerned and facile with their own field of science. Therefore, a critical activity area for the BIRN-CC has been the development of an effective and intuitive interface to the BIRN cyberinfrastructure that facilitates and enhances the process of collaborative scientific discovery for the test bed participants. Ubiquitous access for all users has been accomplished by deploying the BIRN Portal, which: • • • • •
Provides transparent and pervasive access to the BIRN cyber-infrastructure requiring only a single username and password; Provides a scalable interface for users of all backgrounds and level of expertise; Provides customized “work areas” that address the common and unique requirements of test bed groups and individual users; Has a flexible architecture built on emerging software standards allowing for transparent access to sophisticated computational and data service; and Requires a minimum amount of administrative complexity.
More than a simple Web interface, the BIRN Portal environment is designed to provide the integrated collection of tools, infrastructure, and services that BIRN testbed researchers and databases users need access to in order to perform comprehensive and collaborative studies from any location with Internet access. The BIRN Portal is built upon and is driving the production of a collection of services developed that enable the creation of Web-based applications with secure and transparent access to grid infrastructure and resources. These portal services were first developed under NSF/NPACI support [9–10]. and continue to be developed as a general architecture for grid-enabling science communities under the umbrella of the NSF Cyberinfrastructure Initiative and the NIH BIRN initiative.
4. BIRN Systems Integration BIRN researchers need to rely on an ever-increasing collection of software, hardware, and policy resources to effectively and securely share data and collaborate among each other. Logically, BIRN software components are designed as interoperable network accessible services that form building blocks of complete information-centric workflows. Service components include: user identification (certification), user authorization, digital document signing, data location, data mediation, data retrieval, workflow engines and domain-specific applications. Services, once defined, need to be documented, packaged, installed, configured and updated on appropriate physical hardware
J.S. Grethe et al. / Biomedical Informatics Research Network
107
located throughout the BIRN test beds and the BIRN-CC. The BIRN-CC is tasked with defining, integrating, packaging, fielding, and updating a complete BIRN-focused cyberinfrastructure. To effectively address and manage this expanding complexity, the BIRN-CC is formalizing and expanding this process of integration, testing, deployment, and maintenance of this integrated software stack. Twice yearly, in April and October, the BIRN-CC releases (and deploys on the BIRN physical hardware infrastructure) a complete and integrated system. This integrated software stack includes all the above described components (i.e. grid middleware for security, data and computation, data mediation and integration, and BIRN Portal), as well as the underlying system software and the BIRN Bioinformatics packaging of scientific tools and biomedical applications that are in use by and have been defined by the BIRN scientific test beds (e.g. LONI Pipeline [11], 3D-Slicer [12], AFNI [13], AIR [14], etc.). These biomedical applications are packages and then delivered in the Red Hat Package Manager (RPM) format, a de facto standard packaging format. The BIRN Coordinating Center will coordinate and produce these bi-annual software releases in collaboration with BIRN test beds to best optimize the balance of functionality, utility, and robustness. The BIRN-CC has produced internal releases of software and has deployed three successive upgrades of the BIRN Racks during the past two years. This fixed release schedule of April and October incorporates a comprehensive process for the migration from alpha through beta releases to obtain a robust production release. In addition to providing a foundation for a comprehensive testing process, the regular software release schedule provides a well-defined backdrop for application developers, that promotes a robust distribution and deployment of the BIRN software infrastructure. In order for the BIRN-CC to efficiently deploy this integrated software stack across the entire BIRN infrastructure, the BIRN-CC has adopted the NPACI Rocks Cluster Toolkit [15], a set of open-source enhancements for building and managing Linux®-based clusters. While BIRN racks are not traditional high performance computing clusters (each node in a BIRN rack has a specialized function), Rocks has enabled the BIRN-CC to automate the loading of its software stack so that after a few configuration parameters are provided, each resource is then deployed in an automated fashion. This ensures consistent configuration across all grid sites, provides greater reliability, simplifies the diagnostic efforts to correct problems, and thus increases the availability of the BIRN resources. As the BIRN continues to grow, the Rocks Cluster Toolkit will continue to evolve to support more platforms and a greater heterogeneity of site configurations.
5. The Role of the BIRN Coordinating Center The expertise to build the cyberinfrastructure required to deliver performance, reliability, quality of service, scalability, and security for the BIRN test beds is surely outside of the interests and capabilities of the test bed domain scientists. The BIRN Coordinating Center was established to develop, implement and support the information infrastructure necessary to achieve large-scale data sharing among these biomedical researchers. Briefly, the responsibility of the BIRN-CC spans: •
The development of tools and techniques for integrating data from multiple sites and across biological domains;
108
J.S. Grethe et al. / Biomedical Informatics Research Network
• • • • • •
Overall system architecture; Adaptation of existing hardware infrastructure and software; Expansion and integration of specific biomedical tools; Development of new and novel software techniques; Facilitation of collaborative group communication; and Daily operation and monitoring of the distributed BIRN infrastructure.
In order to fulfill its mission, the BIRN-CC comprises a set of unique and wellestablished partnerships between computer scientists, domain scientists and engineers. These partnerships address the large array of technical, policy, and architectural issues required to fundamentally enable a new suite of Information Technology (IT)supported analysis tools. An essential feature of the BIRN is the collaboration of biomedical scientists and computer scientists as equal partners. The variety of perspectives (e.g., biological, policy, information technology) and the high level of interaction among these groups inform every step of the design and implementation of this versatile information technology infrastructure for biomedical research.
6. Conclusion As the BIRN-CC continues to evolve and extend the infrastructure described in this paper, the development and incorporation of new technologies will continue to be tightly coupled to the requirements of and the lessons learned from the test bed projects. The BIRN-CC will continue to mature the existing environment for collaborative data discovery and will address scalability by evolving the BIRN towards an open software/services solution. These activities will target the refinement of key components of the layered BIRN architecture to increase the level of interactivity and interoperability. As such, these components will be easily refined, upgraded, or replaced as needed throughout the life of the program without impacting the entire software stack. Through this evolution, the overall BIRN cyberinfrastructure will further exemplify the services-oriented architecture that is being promoted as a core design principle for future grid computing, and it will more effectively shield the test beds from the rapid pace of IT innovation.
References [1] Basney, J., and Livny, M. (1999) Improving goodput by co-scheduling CPU and network capacity. International Journal of High Performance Computing Applications, 13. [2] Foster, I., Kesselman, C., Tsudik, G., and Tuecke, S. (1998) A Security Architecture for Computational Grids. ACM Conference on Computers and Security, 83-91. [3] Foster, I. and Kesselman, C., eds. (1999a) The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 259-278. [4] Foster, I., and Kesselman, C. (1999b) Globus: A Toolkit-Based Grid Architecture. In Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 259278. [5] Grimshaw, A.S., Wulf, W.A., and the Legion team. (1997) The Legion Vision of a Worldwide Virtual Computer. Communications of the ACM, 40(1): 39-45. [6] Rajasekar, A., Wan, M., Moore, R., Schroeder, W., Kremenek, G., Jagatheesan, A., Cowart, C., Zhu, B., Chen, S-Y., Olschanowsky, R. (2003) Storage Resource Broker - Managing Distributed Data in a Grid. Computer Society of India Journal, Special Issue on SAN, 33(4): 42-54.
J.S. Grethe et al. / Biomedical Informatics Research Network
109
[7] Gupta, A., Ludaescher B., Martone M.E. (2001) An Extensible Model-Based Mediatior System with Domain Maps. Proceedings of ICDE Demo Sessions. [8] Ludaescher B., Gupta A., Martone M.E. (2000) Model-Based Information Integration in a Neuroscience Mediator System. Proceedings of the 26th International Conference on Very Large Data Bases, 639-642. [9] Peltier S.T., Lin A.W., Lee D., Mock S., Lamont S., Molina T., Wong M., Dai L., Martone M.E., Ellisman M.H. (2003) The Telescience Portal for advanced tomography applications. Journal of Parallel and Distributed Computing, 63(5): 539-550. [10] Thomas M., Mock S., Boisseau J., Dahan M., Mueller K., Sutton D. (2001) The GridPort Toolkit Architecture for Building Grid Portals. Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. Aug 2001. [11] Rex, D.E., Ma, J.Q., Toga A.W. (2003) The LONI Pipeline Processing Environment. Neuroimage. 19(3):1033-48. [12] Gering, D., Nabavi, A., Kikinis, R., Grimson, E.L., Hata, N., Everett, P., Jolesz, F., Wells III, W. (1999) An Integrated Visualization System for Surgical Planning and Guidance using Image Fusion and Interventional Imaging. In Proceeding of Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cambridge England, 809-819. [13] Cox, R.W. (1996) AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29:162-173. [14] Woods, R.P., Grafton, S.T., Holmes, C.J., Cherry, S.R., Mazziotta, J.C. (1998) Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of Computer Assisted Tomography, 22:139-152. [15] URL: http://www.rocksclusters.org/Rocks/, January 2004.
This page intentionally left blank
Part 3 Current Projects
This page intentionally left blank
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
113
ProGenGrid: A Grid-Enabled Platform for Bioinformatics Giovanni ALOISIO 1 , Massimo CAFARO, Sandro FIORE and Maria MIRTO CACT/ISUFI and SPACI Consortium, University of Lecce, Italy Abstract. In this paper we describe the ProGenGrid (Proteomics and Genomics Grid) system, developed at the CACT/ISUFI of the University of Lecce which aims at providing a virtual laboratory where e-scientists can simulate biological experiments, composing existing analysis and visualization tools, monitoring their execution, storing the intermediate and final output and finally, if needed, saving the model of the experiment for updating or reproducing it. The tools that we are considering are software components wrapped as Web Services and composed through a workflow. Since bioinformatics applications need to use high performance machines or a high number of workstations to reduce the computational time, we are exploiting a Grid infrastructure for interconnecting wide-spread tools and hardware resources. As an example, we are considering some algorithms and tools needed for drug design, providing them as services, through easy to use interfaces such as the Web and Web service interfaces built using the open source gSOAP Toolkit, whereas as Grid middleware we are using the Globus Toolkit 3.2, exploiting some protocols such as GSI and GridFTP. Keywords. Bioinformatics, Drug design, Web Services, Computational Grid, Grid Portal, Globus Toolkit
1. Introduction The bioinformatics is the study of how information is represented and transmitted in biological systems, starting at the molecular level. In this discipline, as in many new fields, researchers and entrepreneurs at the fringes – where technologies from different fields interact – are making the greatest strides. For example, techniques developed by computer scientists enabled researchers at Celera Genomics, the Human Genome Project consortium, and other laboratories around the world to sequence the nearly 3 billion base pairs of the roughly 40,000 genes of the human genome. This feat would have been virtually impossible without computational methods. The increasing availability of genetic information and the knowledge of relations to hereditary diseases open a new frontier in the medical field with particular regard toward a new form of care oriented to a customized drug therapy. This implies the exchange of information among various scientists, a very large analysis that involves a set of searches 1 Correspondence to: Giovanni Aloisio, Center for Advanced Computational Technologies/ISUFI and SPACI Consortium, University of Lecce, via per Monteroni, 73100 Lecce, Italy Tel.: +39 0832 297221; Fax: +39 0832 297279; E-mail:
[email protected].
114
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
in numerous biological data banks and powerful tools for simulating these operations and computational power to reduce the execution of them. Indeed these biological tools involve the screening of million of ligands or candidate drug in the drug design case and a large mole of data in the pattern matching of entire data banks such as Genbank [1]. These data, stored in different repositories generally geographically spread, are heterogeneous considering genomic [2], cellular [3], structure [4], phenotype [5] and other types of biologically relevant information [6] and often describe the same object that has different representations such as Swiss-Prot [7] where the protein is mapped just as aminoacid sequence or Protein Data Bank (PDB) [4] that contains 3D structure. The semantic relation among these data repositories is a key factor for integration in bioinformatics because it could allow a unique front end for accessing them, needed in the majority of biological applications. An ontology could help here to localise the right type of concept to be searched for as opposed to identify a mere label naming a search table. It includes definitions of basic concepts in the domain and relations among them, which should be interpretable both by machines and humans. Moreover, these repositories are often huge and need to be updated (e.g., for annotations or addition of new entries). To date, many tools exist for simulating complex “in silico” experiments, that is simulation, carried out using the computational power differently by that “in vitro” or “in vivo” that is made respectively outside a living organism or cell and within an organism. As said before these tools need to access heterogeneous data banks, distributed on a wide area, and in particular they need a support infrastructure for obtaining successfully a result [8]. These tools are free and available on the Internet, and there are many software such as EMBOSS [9] and SRS [10] for accessing different data banks. However, there is not a data access service specialized for accessing biological data banks. This service should contain different wrappers to obtain a given information because each data bank utilizes its own scheme, syntax and semantic. A data access service is involved in many biological experiments where Workflow techniques are needed to assist the scientists in the design, execution and monitoring of them. Workflow Management Systems (WFMSs), support the enactment of processes by coordinating the temporal and logical order of elementary process activities and supplying the data, resources and application systems necessary for the execution. Involved high performance computing and collaboration applications require the use of large computing power and suitable hardware and software resources to provide the results shortly [11]. The Grid [12] framework is an optimal candidate for WFMSs because it offers the computational power for high throughput applications and basic services such as efficient mechanisms for transferring huge amounts of data and exchanging them on secure channels. In particular, the Life Science Grid Research Group [13] established under the Global Grid Forum, underlined as a Grid framework (offering services and standards), enhanced through specific services, could satisfy bioinformatics requirements. Indeed, some emerging Bioinformatics Grids (the term refers to Grid platforms devoted to solve bioinformatics applications), such as Asia Pacific BioGRID [14] and myGrid [15], aim to allow: • deployment, distribution and management of needed biological software components;
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
115
• harmonized standard integration of various software layers and services; • powerful, flexible policy definition, control and negotiation mechanism for a collaborative grid environment. So, bioinformatics platforms need to offer powerful and high level modeling techniques to ease the work of e-scientists, and should exploit Computational Grids transparently and efficiently. ProGenGrid (Proteomics and Genomics Grid) is a software platform that integrates biological databases, analysis and visualization tools available as Web Service, supporting complex “in silico” experiments. The choice to couple Web Services [15] and Grid technologies produces components independent of programming language and platform that exploit a grid infrastructure. The proposed solution is based on the following key approaches: web/grid services, workflow, ontologies and data integration through the Grid; in this paper we focus on the use of these components for Drug design. The outline of this paper is as follows. In Section 2 we describe related work. Section 3 introduces the ProGenGrid architecture and its main services whereas in Section 4 we describe the Drug design problem and our solution. We draw the conclusions in Section 5, highlighting future work.
2. Related Work Biogrids are increasingly important in the development of new computing applications for life sciences and to provide immediate medical benefits to individual patients and even to those only at risk of getting sick. Some projects are meant to solve specific biomedical issues, such as the eScience Diagnostic Mammography National Database (eDiaMoND) project (www.ediamond.ox.ac.uk) which aims at developing a grid enabled database of annotated mammograms and myGrid and Asia Pacific BioGrid that are building a grid framework for using bioinformatics tools and handling genotype data. myGrid is an UK research project providing high level grid services for bioinformatics applications for data and application integration. myGrid leverages a service-based architecture where biological resources are modelled as services. These services include resource discovery, workflow enactment, distributed query processing, notification, personalization and provenance. The Asia Pacific BioGrid aims at promoting IT applied research specialized in medical science and biology, building a specific set of Globus-based components, for various bioinformatics and biological applications such as Emboss, Embassy etc. These systems require bioinformaticians to work with complicated grid environments, which are sometimes challenging even for seasoned computer scientists. Therefore, it is crucial to provide easy-to-use tools that do not burden bionformaticians, but benefit from state-of-the-art technologies. Grid-enabled workflow management tools are crucial for successful building and deployment of bioinformatics workflows. Pegasus [17] is a workflow management system designed to map abstract workflows onto Grid resources, through Globus RLS [18], Globus MDS [19] and Globus MCS [20] to determine the available resources and data. Using Pegasus requires producing an abstract workflow in DAX format. The disadvantage of Pegasus is that it lacks the interface with some de-facto bioinformatics tools and it does not support a checking tool for workflow. Another grid-based
116
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
workflow management project is Proteus [21]. Proteus uses ontologies to classify software and data tools, and it also employs workflows to provide an easy problem-solving environment for bioinformaticians. It uses the PEDRO system for modelling experiments. The disadvantage of Proteus lies in its specific applicability to proteomics domain only. Also, it does not utilize web services. What differentiates our work from Pegasus is that ProGenGrid [22] Workflow has an editor for composing and validating the experiment in the modelling phase; it uses the GRB (Grid Resource Broker) libraries [23], Globus based, for jobs execution on the grid and the iGrid information service [24], as web service system, for resource and web services discovery. The main difference w.r.t. Proteus lies in our web services extensions to existing bioinformatics tool so that these can, without user intervention, seamlessly connect to grids. This approach enables legacy applications to use grid-processing power without any significant change.
3. ProGenGrid: Services-Layered Architecture ProGenGrid is a software platform exploiting a Service Oriented Architecture (SOA) that wraps programs and data as Web Services and offers tools for their composition to provide ease of use, reuse and better quality of results. As shown in Figure 1, the services offered by our system are: • Generic Grid Services (GGSs) offered by Globus Toolkit [25], the “de facto” standard of many Grid projects. It offers basic services such as GRAM [26], GSI [27] and MDS. Currently, we are using GT version 3.2 pre-OGSI. • Data Grid Services (DGSs) comprises GridFTP [28], Storage Resource Manager (SRM) and Data Access and Integration (DAI) services. • Semantic Grid Services (SGSs) that comprise ontologies and a metadata configuration repository needed for modelling the data and involved software. In our experimental project, Resource Definition Framework (RDF) is used. • Application Service (AS) represents the single bioinformatics tool that has a web service interface on our system; • Workflow Service (WS) allows the composition of complex activities for designing, scheduling and controlling of bioinformatics applications. The Web Service interface is based on the gSOAP toolkit [29]. Our goal is to implement the Web Service server using the C language to guarantee efficiency, and clients in Java for the portability of our services. 3.1. Generic Grid Services Some basic services are available in the Globus Toolkit that represents one of most used middleware in Grid environments. This toolkit offers services such as MDS, an information service for retrieving static and dynamic information about the resources of a grid, GRAM (Globus Resource Allocation Manager) that simplifies the use of remote systems by providing a single uniform interface for requesting and using remote system resources for the execution of “jobs” and GSI (Globus Security Infrastructure), which is based on public-key cryptography, and therefore can be configured to guarantee confi-
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
117
Figure 1. Services-layered ProGenGrid Architecture.
dentiality, integrity and authentication. In our system, we are using the iGrid system, a relational information system realized by our research group within the European GridLab Project (Work Package 10), based on the Web service approach for retrieving the information needed by the scheduling and controlling mechanism of our workflow service. Regarding the submission of jobs, we are using the GRB library that wraps the Globus APIs, used for facilitating the access and management of a set of grid resources. Finally, we are exploiting the GSI protocol through our modular plug-in for gSOAP [30] which utilizes the GSS API to guarantee the access of trusted users to our services. The GSI plug-in for gSOAP is an open source solution to the problem of securing Web services in grid environments. Our latest version of this software provides the following features: • • • • • • • • •
based on the GSS API for improved performances; extensive error reporting related to GSS functions used; debugging framework; support for both IPv4 and IPv6; support for development of both web services and clients; support for mutual authentication; support for authorization; support for delegation of credentials; support for connection caching.
The plug-in has been developed in the context of the European GridLab project to provide GSI support for the iGrid Web service. 3.2. Data Grid Services One of the major goal of grid technology is to provide efficient access to data. Grids provide access to distributed computing and data resources, allowing data-intensive appli-
118
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
cations to improve significantly data access, management and analysis. Grid systems responsible for tackling and managing large amounts of data in geographically distributed environments are usually named data grids. Data Grids [31] are a fundamental component in bioinformatics, since this field is characterized by different kind of data formats and data from heterogeneous sources contribute to make more difficult an efficient data management. This is due in part to the following reasons: • no single database is able to hold all of the data required by an application; • multiple databases, in general, do not belong to the same institution and are not physically co-located; • to increase the performances and degree of fault tolerance, replicas of a whole dataset or subset are generally required. Some specific services for data grids are GridFTP for efficient transfer (through parallel streams) of huge amounts of data, SRM (Storage Resource Manager) responsible for maintaining replica consistency and DAI (Data Access and Integration), which is required to access multiple databases and data holders. Regarding data transfer, our approach involves the use of both DIME (Direct Internet Message Encapsulation) and GridFTP. Streaming DIME attachment transfer (as per DIME specification [32]) is provided by the gSOAP toolkit and allows streaming binary data of practically unlimited size while ensuring the usefulness of XML interoperability. For the GridFTP service we have built a Web service method that implements the Globus FTP Client API. This method provides features similar to those offered by the globus-url-copy command-line tool, but with added characteristics: while globus-urlcopy allows transferring efficiently from/to a GridFTP server or between two GridFTP servers, it does not allow verifying the existence of a file or directory, creating or removing a directory, obtaining a file listing etc. Instead, these features are offered by our service. Regarding the Storage Resource Manager (SRM) service, we are planning to include in our system two important services: a Data Replication Service (DRS) and a Data Consistency Service (DCS) respectively for managing and maintaining the replicas and data consistency in the Grid. These services will be implemented jointly with another project, GRelC [33] (Grid Relational Catalog), developed at the University of Lecce which goal is to supply applications with a bag of services for data access and integration in a grid environment. In particular, it provides a set of primitives to transparently get access and interact with different data sources. Data access and integration (DAI) is another important service for our platform. In particular, • the characteristics of biological data (e.g. the fact that a protein can be studied considering different representations, such as primary, secondary or ternary structure, and different annotation data), • the peculiar organization and heterogeneity of biological databases (e.g., typical databases such as Swiss-Prot and PDB are flat text files organized by a large set of compressed files), and over all, • the data requirements of bioinformatics tools (e.g. the BLAST [34] computation requires accessing the protein sequences, whereas the Rasmol [35] visualization tools requires the secondary structures, and other text mining tools require the text annotation data), need a specialized data access service taking into account these requirements. ProGenGrid offers two main services: data integration and data federation (See Figure 2).
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
119
Figure 2. Data Integration and federation in the ProGenGrid system.
1. Data integration is responsible for mapping high level requests (user requests) to low level queries (i.e. SQL queries). Traditional data integration systems [36] are characterized by an architecture based on one or more schemas and a set of sources. The sources contain the real data while the schema provides an integrated, virtual view of the underlying sources. In our system the schema is provided by a component called MOR (Metadata Ontology Repository) which contains semantic information about proteomics and genomics data sources. For instance, part of this information is related to logical links between fields of tables belonging to different databases, physical and/or logical replicas of the same dataset, and so on. This level provides a first step in the data virtualisation process, structuring or restructuring data coming from different sources, and so providing the management of complex queries. At the lowest level the access to physical data sources is granted by specific wrappers created at run time. 2. Data federation is responsible for allowing interconnection between application and data sources. A federated database management system (FDBMS) [37] is a collection of cooperating but autonomous component database systems (DBs). The software that provides controlled and coordinated manipulation of the component DBs is called a FDBMS. The DBMS of a component DB can be centralized or distributed and can differ in such aspects as data models, query languages, and transaction management capabilities. Often works with brokers which must bridge data source and requester. This process provides local references to data sources and basic support for data result aggregation.
120
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
It is worth noting here that in both services there is a definition of request in terms of query with different abstraction level (initially, we thought to use SQL, but we are considering other hypotheses, providing a request virtualisation layer). To date, in our infrastructure the integration service is not completely developed (we are planning to integrate the following data banks: EMBL, UTR, UNIPROT, REFSEQ e REFSEQP) whereas the federation broker is based on an extension of the GRelC Server, built to satisfy specific requirements in bioinformatics. Finally, for high throughput applications we are investigating an approach based on our mechanism called SplitQuery which foresees an efficient fragmentation of the biological data set and a protocol for retrieving the fragments by the involved applications, as described in [38]. 3.3. Semantic Grid Services A common issue in bioinformatics is related to data heterogeneity, i.e., the possibility for an object (such as gene or protein) to have multiple meanings. One term with multiple meanings is protein function (biochemical function, e.g. enzyme catalysis; genetic function, e.g. transcription repressor; cellular function, e.g. scaffold; physiological function, e.g. signal transducer). If a user queries a database using an ambiguous term she has full responsibility to verify the semantic congruence between what she asked for and what the database returned. Even if a semantic incompatibility is known it still must be sorted out for each search result. Another reason demanding for standardised nomenclatures in biology is the merging of different subfields that historically started rather independently but now with a more integrated approach to biology must be closely integrated. This concerns e.g. genetics, protein chemistry, pharmacology. Since these areas have grown quite distinguished terminology especially large pharmaceutical companies feel an urgent need to harmonise the technical language to store their corporate knowledge in a central, unified database. Ontologies could help here to localize the right type of concept to be searched for as opposed to identify a mere label naming a search table. In the ProGenGrid system, the ontology is used at two levels: • Workflow Validation during the composition of tasks without known applications details (input type, data type, etc.) and conversion of input data, if needed. In particular we classified ProGenGrid components as: data banks, bioinformatics algorithms, graphics tools, drug design tools and input data types. In particular we have used the RDF schema for processing metadata in order to provide interoperability between applications, and to enable automated processing of web resources. The syntax of RDF uses XML: one of the goals of RDF is to make it possible to specify semantics for data based on XML in a standardized, interoperable manner. This first ontology, written in DAML+OIL [39], has been stored in a relational database. • Data Access, in particular to guarantee: (i) Semantic integration of different data sources as explained in the previous subsection. Currently we are using GeneOntology [40]; (ii) Analysis of stored output data coming from different experiments. 3.4. Application Service This service represents the Web service interface to application tools. It consists of an interface utilized by the user or another application to insert the parameters needed for
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
121
the execution of a specific application. So we would like to offer our services through a Grid Portal, this service represents the Web interface for each Web service component running on remote grid machines. Regarding to discovery of these services, we use the iGrid system for registering and looking up a bioinformatics tool, available through the system. In particular we use the following methods: • gridlab__register_webservice: for registering information like service name, owner, description, access URL, WSDL location and keywords; • gridlab__lookup_webservice: for looking up the service on the basis of different search criteria. 3.5. Workflow Service Workflow is a critical part of emerging Grid Computing environments and captures the linkage of constituent services together in a hierarchical fashion to build larger composite services. WorkFlow Management Systems (WFMSs) support the enactment of processes by coordinating the temporal and logical order of the elementary process activities and supplying the data and resources necessary for execution. A WFMS: • allows a clear business process (biological experiment) definition and reproducibility because the process, the input parameters and the program versions used are clearly defined and these have not to be redefined at any time; • performs complex computations which are executed repeatedly by one or more users. It automatically executes large computations as needed for automated optimization or robustness evaluation. We use the workflow technology to model and design complex “in silico” experiments composed by different Web/Grid services. Figure 3 illustrates the processing steps of our Workflow. The user composes her own experiments using our editor. When the editor starts, it queries a Metadata Ontology Repository (MOR) for discovering available bioinformatics tools, data banks, graphics tools, modeled through a ontology. Since we are considering such components as Grid services, we plan to extend the iGrid Web service to manage registration and retrieval of such Grid services. Discovered components are made available to a semantic editor that allows designing (i.e. the activities are modeled using UML) an experiment. During workflow creation the workflow model is validated through rules derived by metadata and ontology. The workflow model is translated into an “execution plan” containing the activities order and logical name of the resources (needed for their discovery in Grid environments). The execution plan (EP) is coded through a set of XML instructions extending the GGF workflow specification [41]. The enactment service schedules the workflow in a computational grid. It discovers the needed services querying the Resource Discovery and Selector which uses an information system, iGrid and NWS [42] and then generates the worklist. The worklist has the execution information of the activities and its execution order. The scheduler invokes the Web Services related to each activity, and updates the EP reflecting the workflow status. Whenever workflow activities are started/finished, the system visualizes the workflow execution progress using a graphical utility.
122
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
Figure 3. WorkFlow Management Architecture.
4. ProGenGrid Drug Design An important service offered by our system is drug design. This process involves various steps beginning from the synthesis in laboratory of a compound, candidate drug, to the introduction of the therapeutic agent or drug into market. Using a traditional approach this process can take many years (12–15) due to clinical testing for establishing the toxicology and possible side effects. The R&D sections of many pharmaceutical companies aim at reducing the research timeline in the discovery stage. In particular, molecular modelling has emerged as a popular methodology for drug design combining different disciplines such as computational chemistry and computer graphics. It can be redesigned as a distributed system involving many resources for the screening of a large number (of the order of a million) of ligand records or molecules of compounds in a chemical database to identify those that are potential drugs, taking advantage of HPC technologies such as clusters and Grids for large-scale data exploration. This process is called molecular docking and predicts how small molecules, drug candidates, bind to an enzyme or a protein receptor of known three-dimensional (3D) structure. The receptor/ligand binding is a compute and data intensive task due to the large data sets of compounds to be screened. Our goal is to use Grid technologies to provide large-scale parallel screening and docking, reducing the total computation time and costs of the process. So, scientists simulate receptor-ligand docking and get a score as a criterion for screening. As an example, we model the drug design application with our workflow editor (Figure 4) involving the needed software in this process. In particular, we consider the DOCK [43] software, a popular tool for receptor-ligand docking. It takes as input files of ligand and
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
123
Figure 4. Drug Design case study.
receptor and outputs a score and 3D structure of docked ligand. In particular the workflow starts with crystal coordinates of target receptor, i.e or its FASTA format (in this example, the protein target is 1NXB), then the AutoMs [44] tool is used to generate molecular surface for receptor and Sphgen [45] generates spheres to fill in the active site (the centers of the spheres become potential locations for ligand atoms). The DOCK software matches the sphere centers to the ligand atoms (extracted by structural databases such as PDB), and uses scoring grid (generated by the grid program) to determine possible orientations for the ligand. Finally, the Rasmol tool visualizes the docked ligand protein. The main issues raised by this kind of application are due to the computation and heterogeneity of the interfaces of the involved tools. Indeed, the screening can involve million of ligands and hence requires high performance computing resources, the size of repositories containing these ligands often is in the range of gigabytes and the involved tools must be compiled and installed. To solve partially the computational time issue, we would like to transform the DOCK program (but also other existing molecular docking applications, such as GAMESS — General Atomic and Molecular Electronic Structure System — [46] and AUTODOCK — Automated Docking of Flexible Ligand to Micromolecules — [47]) into a parameter sweep application, for execution on distributed systems. It is worth noting here that we do not intend to update the existing sequential docking application but to partition the input data files to submit each dock job using our GRB library. Moreover, we are developing a unique front-end to enable access to ligand molecules in the 3D-structure databases from remote resources (stored on a few grid nodes, given the large storage required), including related indexing mechanisms to facilitate reading the compounds, while a resource broker is used for scheduling and on-demand processing of docking jobs on grid resources. Finally, to solve the interface heterogeneity issue, the docking tools will be available as Web services, so the bioinformaticians will not need to know details about installation or configuration of these tools.
124
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
5. Conclusions and Future Work In this paper we have presented the main services of ProGenGrid, a virtual laboratory where the e-scientists can use bioinformatics tools and compose, execute and control them using a WorkFlow Management System. In particular, we have presented the tools and the DAI that we would like to implement for an efficient drug design Web Service. The main feature of this system is the possibility for bioinformaticians of composing, modelling and executing their own biological experiment having a set of tools already configured and running in a distributed environment. The bioinformaticians are not required to know technical details about application requirements or where an application is running. Moreover, the system guides the user assisting her in the composition of an experiment, indicating if any error has occurred (wrong input format or incompatibility of the formats of applications in cascade). In our opinion, this system could help in the routine work in a pharmaceutical company for simulating drug design experiments, using a set of computing resources for reducing the time spent screening millions of chemical compounds. In the future, we plan to complete the implementation of our architecture and present experimental results, obtained comparing our system against other solutions and platforms. Moreover, the Globus Alliance and IBM recently proposed the Web Service Resource Framework (WSRF) [48], a set of specification designed to merge Grid and Web technologies by enclosing the Open Grid Services Infrastructures (OGSI) concepts in terms of currents Web service standards. So we plan to migrate our architecture to WSRF using the upcoming Globus toolkit v4.
References [1] Benson, D. A., Boguski, M. S., Lipman, D. J. and Ostell, J: GenBank. Nucleic Acids Res 25 (1997), 1-6. [2] Fasman, K. H., Letovsky, S. I., Cottingham, R. W. and Kingsbury, D. T.: Improvements to the GDB Human Genome Data Base, Nucleic Acids Res 24 (1996), 57–63. [3] Jacobson, D. and Anagnostopoulos, A.: Internet resources for transgenic or targeted mutation research, Trends Genet. 12 (1996), 117–118. [4] Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Shimanouchi, O. K. T. and Tasumi, M.: The Protein Data Bank: a computer-based archival file for macromolecular structures, J. Mol. Biol 112 (1977), 535–542. [5] McKusick, V. A.: Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, Johns Hopkins University Press ed. 11 (1994). [6] Bairoch, A.: The ENZYME data bank, Nucleic Acids Res 21 (1993), 3155–3156. [7] B., Boeckmann, A., Bairoch, R., Apweiler, M., Blatter, A., Estreicher, E., Gasteiger, M. J., Martin, K., Michoud, C., O’Donovan, I., Phan, S., Pilbout, and M., Schneider: The Swiss-Prot protein knowledgebase and its supplement TrEMBL, Nucleic Acids Res 31 (2003), 365–370. Site address: http://www.ebi.ac.uk/swissprot/. [8] Özsu, M.T., & Valduriez, P.: Principles of Distributed Database Systems, 2nd edition, Prentice Hall (Ed.) (1999), Upper Saddle River, NJ, USA. [9] Rice, P. Longden, I. and Bleasby, A.: EMBOSS: The European Molecular Biology Open Software Suite, Trends in Genetics 16(6) (2000), 276–277. Site address: http://www.ch.embnet.org/EMBOSS/. [10] SRS Network Browser. Site address: http://www.ebi.ac.uk/srs/srsc/. [11] WfMC. Workflow management coalition reference model. Site address: http://www.wfmc.org/. [12] I., Foster, C., Kesselman: The Grid: Blueprint for a New Computing Infrastructure, Published by Morgan Kaufmann (1998). [13] Life Sciences Grid (LSG-RG). Site address: http://www.ggf.org/7_APM/LSG.htm.
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
125
[14] T.T. Wee, M.D. Silva, L.K. Siong, O.G. Sin, R. Buyya, and R. Godhia: Asia Pacific BioGRID Initiative, Presentation Slides at APGrid Core Meeting, Phuket, (2002). Site address: http://www.apbionet.org/grid/docs/. [15] myGrid Project, University of Manchester. Site address: http://mygrid.man.ac.uk/. [16] Kreger, H.: Web Services Conceptual Architecture. WSCA 1.0. IBM, 2001. [17] Ewa Deelman, James Blythe et al.: Mapping Abstract Complex Workflows onto Grid Environments, Journal of Grid Computing 1(1) (2003), 25–39. Site address: http://pegasus.isi.edu. [18] A. Chervenak et al.: Giggle: A Framework for Constructing Scalable Replica Location Services, Proceedings of Supercomputing 2002 (SC2002), Baltimore, MD, (2002). [19] Czajkowski K., Fitzgerald S., Foster I., Kesselman C.: Grid Information Services for Distributed Resource Sharing, Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, (2001). [20] G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Mahohar, S. Pail, L. Pearlman: A Metadata Catalog Service for Data Intensive Applications, Proceedings of Supercomputing 2003 (SC2003), November 2003. [21] M. Cannataro, C. Comito, F. Lo Schiavo, and P. Veltri: Proteus, a Grid based Problem Solving Environment for Bioinformatics: Architecture and Experiments, IEEE Computational Intelligence Bulletin 3(1) (2004), 7–18. [22] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto: ProGenGrid: A Grid Framework for Bioinformatics, Proceedings of International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2004), September 14-15 2004, Perugia, Italy. [23] Aloisio, G., Blasi, E., Cafaro, M., Epicoco, I.: The GRB library: Grid Computing with Globus in C., Proceedings HPCN Europe 2001, Amsterdam, Netherlands, Lecture Notes in Computer Science, SpringerVerlag, 2110 (2001), 133–140. [24] G. Aloisio, M. Cafaro, I. Epicoco, S. Fiore, D. Lezzi, M. Mirto and S. Mocavero: iGrid, a Novel Grid Information Service, to appear in the Proceedings of European Grid Conference (EGC 2005), Springer Verlag, 2005. [25] I., Foster, C., Kesselman: Globus: A Metacomputing Infrastructure Toolkit, Intl J. Supercomputer Applications 11(2) (1997), 115–128. [26] Globus GRAM. Site address: http://www-unix.globus.org/toolkit/docs/3.2/gram/key/index.html. [27] S. Tuecke, Grid Security Infrastructure (GSI) Roadmap. Site address: www.gridforum.org/security/ggf1_2001-03/drafts/draft-ggf-gsi-roadmap-02.pdf. [28] GridFTP Protocol. Site address: http://www-fp.mcs.anl.gov/dsl/GridFTP-Protocol-RFC-Draft.pdf. [29] Van Engelen, R.A., Gallivan, K.A.: The gSOAP Toolkit for Web Services and Peer-To-Peer Computing Networks. Proceedings of IEEE CCGrid Conference, May 2002, Berlin, 128–135. [30] Aloisio, G., Cafaro, M., Lezzi, D., Van Engelen, R.A.: Secure Web Services with Globus GSI and gSOAP. Proceedings of Euro-Par 2003, 26th - 29th August 2003, Klagenfurt, Austria, Lecture Notes in Computer Science, Springer-Verlag, 2790 (2003), 421–426. [31] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific datasets, Journal of Network and Computer Applications 23 (2001), 187–200. [32] DIME Specification. Site address: http://msdn.microsoft.com/library/en-us/dnglobspec/html/draftnielsen-dime-02.txt. [33] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto: The GRelC Project: Towards GRID-DBMS, Proceedings of Parallel and Distributed Computing and Networks (PDCN) IASTED, Inns-bruck (Austria) February 17-19 (2004). Site address: http://gandalf.unile.it. [34] Altschul, Stephen F., Gish Warren, Webb Miller, Eugene W. Myers, and David J. Lipman: Basic local alignment search tool, J. Mol. Biol. 215 (1990), 403–410. [35] Roger A. Sayle and E. J. Milner-White: RasMol: Biomolecular graphics for all, Trends in Biochemical Science (TIBS) 20(9) (1995), 374. Site address: http://www.umass.edu/microbio/rasmol/. [36] A. Sheth and J. Larson: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys 22(3) (1990), 183–236. [37] M. Lenzerini: Data Integration: A Theoretical Perspective, In Proceedings of the 21st ACM SIGMODSIGACT-SIGART symposium of Principles of database systems (PODS) (2002), 233–246. ACM Press. [38] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto: Bioinformatics Data Access Service in the ProGenGrid System, Proceedings of the First International Workshop on Grid Computing and its Application to Data
126
G. Aloisio et al. / ProGenGrid: A Grid-Enabled Platform for Bioinformatics
Analysis (GADA 2004), OTM Workshop 2004 3292 (2004), 211–221. [39] Daml.org. Daml+oil language. Site address: http://www.daml.org/2001/03/reference.html. [40] The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology, Nature Genet. 25 (2000), 25–29. [41] H.P. Bivens. Grid Workflow. Grid Computing Environments Working Group Document, 2001. Site address: http://dps.uibk.ac.at/uploads/101/draft-bivens-grid-workflow.pdf. [42] R. Wolski, N. Spring and J. Hayes: The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Future Generation Computing Systems 15 (5/6) (1999), 757– 768. [43] T. J. A. Ewing and I.D. Kuntz: Critical Evaluation of Search Algorithms for Automated Molecular Docking and Database Screening, J. of Computational Chem. 18 (9) (1996), 1175–1189. Site address: http://dock.compbio.ucsf.edu/. [44] AutoMS. Site address: http://dock.compbio.ucsf.edu/dock4/html/Manual.23.html#33642. [45] Sphgen. Site address: http://dock.compbio.ucsf.edu/dock4/html/Manual.20.html#17338. [46] GAMESS. Site address: http://www.msg.ameslab.gov/GAMESS/GAMESS.html. [47] AUTODOCK. Site address: http://www.scripps.edu/pub/olson-web/doc/autodock/. [48] The WS-Resource Framework. Site address: http://www.globus.org/wsrf/.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
127
A Grid Architecture for Medical Applications Anca BUCUR a, René KOOTSTRA a and Robert G. BELLEMAN b Philips Research, Prof. Holstlaan 4, 5656 AA Eindhoven, the Netherlands {anca.bucu,rene.kootstra}@philips.com b Universiteit van Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, the Netherlands
[email protected] a
Abstract. Grid technology can provide medical organisations with powerful tools through which they can gain coordinated access to computational resources that hitherto where inaccessible to them. This paper discusses how several classes of medical applications could benefit from the use of Grid technology. We concentrate on applications that were put forward by partners in the Dutch VL-e project. After describing the difficulties related to the realization of such applications without making use of the Grid, we describe an architecture that allows the applications to use Grid resources. We demonstrate how this architecture can be integrated into existing systems to provide flexible and transparent access to Grid services and show performance results of a test case. Keywords. Grid-computing, adaptive architecture, compute-intensive medical applications, fiber tracking, decomposition, speedup
Introduction The “Virtual Laboratory for eScience” (VL-e1) project was initiated after the successful “Grid-enabled Virtual Laboratory Amsterdam” (VLAM-G) project [1]. In the VLAM-G project, an experimentation environment was built that allows scientists to construct experiments that are composed of geographically distributed computational resources as if they are part of the same organisation. The VL-e project will address challenging problems, including the manipulation of large scientific datasets, computationally demanding data analysis, access to remote scientific instruments, collaboration, and data dissemination across multiple organizations. The methods, techniques, and tools developed within the VL-e project are targeted for use in many scientific and industrial applications. The project will develop the infrastructure needed to support these and other related e-Science applications, with the aim to scale up and validate the developed methodology. The VL-e philosophy is that any Problem Solving Environment (PSE) based on the Grid-enabled Virtual Laboratory will be able to perform complex and multi-disciplinary experiments, while taking advantage of distributed resources and generic methodologies and tools. In the VL-e project, several PSEs will be developed in parallel, each applied to a different scientific area. One of these areas is Medical Diagnosis and Imaging which is the focus of this research. 1
http://www.vl-e.nl/.
128
A. Bucur et al. / A Grid Architecture for Medical Applications
This paper discusses how several classes of medical applications can benefit from parallel computing through the use of Grid technology and describes an adaptive Gridbased architecture suitable for all these application classes. In Section 1 we introduce three computationally challenging medical applications relevant in the context of the VL-e project and perform an algorithmic analysis on each of them. In Section 2 we discuss three decomposition patterns that can be used to parallelize the applications described in Section 1. Section 3 proposes a flexible and generic architecture suitable for applications that fit the three decomposition patterns, and in Section 4 we describe a first case study of a medical application. We perform experiments in a Grid environment with a parallel version of the application and assess its performance and scalability.
1. Analysis of Medical Applications The partners in the VL-e project have selected a number of computationally challenging applications. In particular, the clinical use of these applications is hampered by insufficient computational power. In this section we describe these applications and assess whether the underlying algorithms are suitable for parallelization. 1.1. White Matter Fiber Tractography Recent advances in Magnetic Resonance (MR) research have opened up opportunities for gathering functional information about the brain. In addition, the development of Diffusion Tensor Imaging (DTI) offers the possibility to go beyond anatomical imaging and study tissue structure at a microscopic level in vivo [2]. The method uses a medical imaging technique known as Diffusion Weighted Magnetic Resonance Imaging (DWMRI) to measure the restricted movement of water molecules along the direction of fibers. From these measurements, a tensor is constructed that describes diffusion in multiple directions. An important application of DTI is fiber tracking (FT). This application uses the anisotropic diffusion of water molecules in the brain to visualize the whitematter tracts and the connecting pathways between brain structures. Combined with functional MRI, the information about white-matter tracts reveals important information about neuro-cognitive networks and may improve the understanding of brain function. Several studies have demonstrated the feasibility of reasonably fast FT in the human brain [3]. Other research concentrates on improving the accuracy, the robustness and the throughput of the FT [4]. There are several clinical applications where FT is relevant, such as psychiatry [5], surgical planning or stroke detection. Algorithmic Analysis: There are various solutions to FT, but the common feature is that, starting from various points, white matter fibers have to be tracked in the entire data domain. The number of detected fibers (and therefore the accuracy of the algorithm) grows with the number of considered starting points. For areas with a high concentration of fibers, too many detected fibers may lead to an indistinguishable image, which makes selection necessary. Since choosing fewer starting points would not be a good option (it would decrease the accuracy), the selection is in general performed after the fibers are generated, by specifying a number of regions of interest that the fibers have to cross. The execution time of the application depends on the number of starting points, the algorithm, and the size of the data set, and can amount to many hours. FT
A. Bucur et al. / A Grid Architecture for Medical Applications
129
would become clinically relevant if the throughput was increased without decreasing the accuracy of the result. This can be achieved by parallelization. In order to pay off, the parallel solution has to be scalable and the amount of communication among the processors performing the algorithm has to be kept to a minimum. Fibers are tracked in the entire domain, so directly decomposing the data domain among the participating processors would not be viable due to the high need for communication and synchronization among processors. The starting points however can be distributed among processors with little extra synchronization and communication. Therefore, this problem is suited for computational decomposition, meaning that each processor that takes part in the computation receives the entire data domain, but the computation domain (i.e., the starting points) is divided among processors. 1.2. Functional Bowel Imaging and MR Virtual Colonoscopy It takes many years before colon polyps become cancerous. Therefore, periodical screening can help in preventing colon cancer. However, current methods to detect colon polyps are very intrusive and are unsuitable for screening. Alternatives exist that use Computed Tomography (CT) to image the colon; the images can then be used to perform a virtual colonoscopy through scientific visualization techniques [6]. However, the ionizing radiation used in CT also makes this alternative unsuitable for screening purposes. Another alternative is to optimize the imaging strategies to exploit the higher signal-to-noise ratio of 3 Tesla MRI. In video mode, the MR scanner can generate realtime images. In combination with high-resolution static anatomical images the data will be processed and then viewed in an interactive 3D representation. Algorithmic Analysis: In these applications the volume reconstruction for visualization is performed using a combination of image processing and isosurface extraction algorithms which construct an isosurface from a 3D field of values. The idea of the algorithm is to divide the three-dimensional geometrical space of the problem into a grid of cubes. For each cube that crosses the surface, the part of the surface contained by the cube is approximated by the appropriate set of polygons. The union of all polygons approximates the surface and the precision of the approximation is proportional with the grid resolution. The basic approach is to traverse all cells in the space to find the approximation of the surface in each cell. For good accuracy of the result, the data set needs to be quite large, which can lead to long execution times. This problem can be parallelized by splitting the data domain into a number of sub-domains which are then distributed among the available processors. This method is called domain decomposition; each processor independently performs the algorithm on its sub-domain. At the end, each processor contributes to the final result with the surface generated in its own sub-domain. 1.3. Computer Aided Diagnosis and Surgical Planning in Cardiovascular Disease Vascular disorders in general fall into two categories: Stenosis, a constriction or narrowing of the artery by the build-up over time of fat, cholesterol and other substances in the vascular wall, and aneurysm, a ballooning-out of the wall of an artery, vein or the heart due to weakening of the wall. A vascular disorder can be detected by several imaging techniques such as X-ray angiography, MRI or CT.
130
A. Bucur et al. / A Grid Architecture for Medical Applications
A surgeon may decide on different treatments in different circumstances, but all these treatments aim to improve the blood flow of the affected area. The purpose of vascular reconstruction is to redirect and augment blood flow, or perhaps repair a weakened or aneurysmal vessel through a surgical procedure. Pre-operative surgical planning would allow the a priori evaluation of different procedures under various physiologic states, such as rest and exercise, thereby increasing the chances of a positive outcome for the patient [7,8]. Algorithmic Analysis: The current approach for the treatment of patients with vascular disease is to first image the diseased area through angiography, then to plan the treatment, possibly followed by surgical intervention. We envision a simulated vascular reconstruction environment to provide decision support information to a vascular surgeon during treatment planning. To achieve this, our proposed system consists of the following: First, the vascular geometry of the patient is acquired through volumetric scanning methods (such as CTA or MRA). This scan needs to be of sufficient resolution and contrast to allow accurate isolation of the vascular morphology from the scan, including the diseased area. Next, image segmentation is used to isolate the vascular morphology from the scan. The resulting morphology is used to construct a computational data structure that can be used in a blood flow simulator, which is used to simulate the physical properties of blood flow through the vascular geometry. Properties that are of interest include pressure, flow velocity and wall shear stress. The results of this simulation are presented to the surgeon through scientific visualization methods. Based on the visualization, the surgeon will be able to propose a viable treatment. The surgeon simulates the treatment by interactively altering the computational data structure used in the flow simulation. Based on this new data structure, the flow simulator calculates a new flow solution and presents the results to the surgeon. The system just described consists of several tightly interfaced software components [7]. Each of these components has its own unique computational requirements. Therefore, they can execute independently from each other in parallel. This decomposition method is called functional decomposition.
2. Decomposition Patterns As described in Section 1, the applications that we have presented fit into three classes of decomposition patterns which allow them to exploit parallelism. Their algorithms exhibit a significant degree of spatial locality in the way they access memory as well as time locality in the sequence of operations that are performed. It may be expected that the problem sizes will increase in time, requiring increasingly powerful computational resources. Furthermore, with the availability of increasing computational power and wide access to (geographically) distributed resources it can be expected that new applications will emerge and that some of the existing ones will gain importance. In this context, our goal is to provide a general architecture that is suitable for a wide range of applications. Besides improving the performance of each application, the architecture should be scalable relative to the data volume, and should allow changes in the computational algorithm with a minimum of changes in the algorithmic structure and no change at all in the architecture itself. By studying the common and differentiating features of the application classes that we selected, we are able to design an adaptive Grid architecture which can be applied to applications fitting at least one of the decomposition patterns.
A. Bucur et al. / A Grid Architecture for Medical Applications
1‚
2‚
1‚
A
B
C
D
131
A
B
C
D
1‚
2‚
2‚
3‚
3‚ 4‚
4‚ 4‚
3‚
Figure 1. Decomposition patterns; (a) domain decomposition, (b) computational decomposition, (c) functional decomposition.
2.1. Domain Decomposition With this decomposition pattern, the data domain of the application is split into disjoint partitions among the participating processors. Each processor performs the same algorithm on its own partition of data, preferably with a minimum amount of communication or synchronization (see Figure 1(a)). When there is no communication among processors we speak of pure domain decomposition. Examples of algorithms in this group are the image processing and isosurface extraction algorithms used for the purpose of volume reconstruction as described in the previous section. When the processors need to exchange data during the execution of the application we speak of domain decomposition with data exchange. The communication may occur at few isolated instances or may have an iterative nature. In this class of applications are various image processing, scientific visualization and computational simulation algorithms. 2.2. Computational Decomposition With computational decomposition, each processor performs the same set of computations on a disjoint part of the domain but needs access to the entire data set (see Figure 1(b)). The computational domain of the application is split among the processors, while the data domain is shared. This paradigm applies to the FT application described in the previous section. This application will be described in more detail in Section 4. 2.3. Functional Decomposition For this decomposition pattern it is characteristic that several algorithms are performed on several data sets in fixed succession. It is in fact a specialization of activities among processors: Each processor is responsible for the execution of one algorithm, then the data set is passed to another processor for performing the next algorithm (see Figure 1(c)). The current processor then moves on to the next data set, resulting in a pipelined execution. An example of an application that fits this paradigm is the vascular reconstruction application described in Section 1; here image processing algorithms provide input to a flow simulation algorithm, which in turn provides input to a scientific vvisualization algorithm.
132
A. Bucur et al. / A Grid Architecture for Medical Applications
Figure 2. General architecture for solving compute-intensive medical applications using Grid technology.
3. GAMA: Grid Architecture for Medical Applications The three decomposition patterns described in the previous section can all be modeled within a general framework. Figure 2 depicts our adaptive architecture designed to simultaneously support several applications fitting at least one of the decomposition patterns. As this figure shows, we chose for the client-server architecture; the server can simultaneously provide different sets of services for each of the application clients. Our primary intention is to make this framework minimally invasive, in the sense that the influence on the end-user is as small as possible. Instead of being enforced, the Gridbased solution will either be offered as an option to the user, or the client application will automatically choose whether to use external Grid resources or not. In the event that insufficient resources are available, the applications should automatically fall back to their local version. Currently, the applications (situated in the hospital) run on Windows-based workstations. At the other end, Grid technology is centered around Globus, a software interface that provides the Grid “middleware” [9]. Globus is based on the Unix operating system. Therefore, in order to enable such applications to use the Grid for their execution, the compute-intensive part of the application has to be removed from the rest of the application and placed in the Grid environment. To provide an interface from the Windows environment to a Grid infrastructure that is designed around Globus, a Linux machine
A. Bucur et al. / A Grid Architecture for Medical Applications
133
Figure 3. Applying the GAMA architecture to the fiber tracking application.
called the “Grid Access Point” (GAP) receives the requests from the hospital side, which we call the client side, and allocates the processors on the Grid, passing on the requests and returning the results to the client. The GAP may use Globus for submitting the requests to the Grid nodes, thereby exploiting the security and execution facilities offered by it. Globus is entirely Unix based, so it could not be used if we chose for directly connecting the client side to the Grid nodes. The overall performance of the application benefits from keeping the client side as little involved in the computational algorithm as possible and from restricting the communication with the rest of the system to job-submission requests only. This is because in most cases the client side will be connected to the rest of the system via slow network links. For high throughput, the GAP should be connected to the Grid infrastructure via a fast network.
4. Test Case: Fiber Tracking As a first case study, we apply the GAMA architecture to the FT application (see Figure 3). The Grid-enabled FT application should gain performance by distributing its computational part, the FT algorithm, across Grid computational resources. We developed a parallel version of the FT algorithm and assessed its performance for several sets of application parameters and different numbers of processors. The purpose of this experiment is to check whether this type of application can benefit from our Grid-based architecture.
134
A. Bucur et al. / A Grid Architecture for Medical Applications
4.1. The Environment The sequential FT application was built with the Philips Research Imaging Development Environment (PRIDE) on a Windows NT-based machine using the Interactive Data Language (IDL)2. The sequential FT application is based on a prototype application running on a single Windows workstation. It was modified to generate data sets, parameters and results. We ported this application to Unix and parallelized it. Our experiments with the parallel version of FT were performed on the second-generation Distributed ASCI Supercomputer (DAS-2)3. The DAS-2 is a wide-area computer system located at five Dutch universities. It consists of 200 nodes split into five clusters, one with 72 nodes, the other four with 32 nodes each. Programs are started on the DAS-2 using the PBS batch queuing system, which reserves the requested number of nodes for the duration of a program run. We submitted the jobs to the DAS-2 using Globus and used MPI to implement parallelism. The outputs of the executions of the sequential FT application on the single workstation and of the parallel FT application on the DAS-2 system were compared in order to verify the correctness of the distributed solution. 4.2. Results In this section we present results of experiments performed with our parallelized version of FT. With full volume fiber tracking (FVFT), the starting points are evenly distributed in the entire domain. Compared to other solutions, e.g. placing starting points only in the regions of interest (ROIs), FVFT has higher computational needs but also higher accuracy, detecting a larger number of fibers and also detecting splitting and crossing fibers. When comparing the performance of the sequential FT application to the distributed FT, we scaled the values to take into account the difference in CPU speed between the DAS-2 nodes and the workstation running the sequential version. However, these two applications run on different architectures and the comparison is only relevant as an indication of the potential performance gain through parallelization. Our first experiments show that tracking long fibers takes noticeably longer than tracking short fibers, or checking areas with no fibers. It is also the case that fibers are in general grouped in large bundles. Since in our solution jobs are rigid (i.e. all tasks start and end at the same time), the longest task determines the execution time. This implies that simply splitting the computational domain into a number of sub-domains equal to the number of processors is not an efficient solution: Processors receiving parts of the domain with many long fibers perform a large amount of work, while processors receiving parts of the domain with no fibers spend most of the time waiting. As an alternative, we designed a solution which splits the domain on one of the axes in slices of width equal to the size of the voxel. These slices are then distributed among the processors using Round Robin, and that yields a better workload balance. The fraction of the algorithm that tracks a fiber from a starting point is inherently sequential and limits the speedup. For identical data sets, we compare two cases differentiated by the step size for tracking the fibers. 2 3
http://www.rsinc.com/idl/. http://www.cs.vu.nl/das2.
30
30
25
25
Speedup
Speedup
A. Bucur et al. / A Grid Architecture for Medical Applications
20 15
135
20 15
10
10
5
5 0
0 0
10
20
30
40
50
Number of processors
60
0
10
20 30 40 50 Number of processors
60
Figure 4. The speedup of the FVFT algorithm for large step (a) and small step (b) for tracking fibers.
We first studied the scalability of the application for a large step size. In this case the computation time for a single fiber is short compared to the total execution time. The execution time of the sequential FT algorithm was the equivalent of 440s, while the parallel FT algorithm took 1065s to complete on a single processor. We compared the execution time for one, two and four ROIs. The results showed that for FVFT the number of ROIs has almost no influence on performance. We ran the application on up to 64 processors and concluded that the speedup is almost linear for up to 32 processors (see Figure 4(a)). The minimum execution time is limited by the slice of starting points requiring the longest time to compute and by the initialization time, communication time, and time required to store the results to disk. For more than 32 processors the performance improvement is very small: The computation time is still reduced by further splitting the computational domain (the limit of the inherently sequential part of the algorithm is not reached yet), but the execution time is increasingly dominated by the initialization and the communication time. The time for storing the results is constant (only one of the processors writes to the disk), while the communication time increases proportionally with the number of processors (from 0.12s for 2 processors to 5.7s for 64 processors). For 64 processors, about 20% of the execution time is spent in communication. For the same data set, we performed experiments with a very small step size. This change significantly increased the execution time to the equivalent of more than 18 hours for the sequential version of the algorithm. Figure 4(b), shows the speedup obtained in this case. Similarly to the previous case, the scalability is very good for up to 32 processors. The communication time is not influenced by the increase in computation time but only by the number of processors performing the algorithm, so it has similar values to the previous case. Executing the application on more than 32 processors does not seem to pay off. More investigation is needed to identify the reasons for this limitation in performance and to discern whether it would similarly affect other applications fitting computational decomposition. Since we are aiming at an adaptive framework, any solution to further improve the performance should be independent on the algorithmic details of the application, so that it could suit other applications in the same decomposition class.
136
A. Bucur et al. / A Grid Architecture for Medical Applications
5. Conclusions and Future Work The fiber tracking test case described in the previous section illustrates an example of one of the three decomposition paradigms described in Section 2. Our future work includes the connection of the parallelized FT application through the GAP to its Windows interface, and the extension of the GAMA architecture to applications fitting the other two decomposition paradigms that we have presented. Other applications with different decomposition characteristics will also be investigated in the future, including (but not limited to) image processing, image registration, scientific visualization and computer graphics algorithms. If GAMA were to be used in a hospital environment today, the communication overhead from transferring images to the computing back-end may be significant enough to kill the performance gain obtained from parallelization. GAMA benefits from an architecture where data is stored “closer” (in terms of time required to communicate) to the computing back-end so that communication overhead is minimized. Such a situation could occur when hospitals store medical data on remote storage resources; having the data close to the computational resources would decrease the communication overhead that now occurs in GAMA. One such situation is currently under investigation and explores the possibility of adapting SDSC’s SRB4 so that it functions as a PACS server.
Acknowledgements Part of this work was carried out in the context of the Virtual Laboratory for e-Science project (http://www.vl-e.nl/). This project is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ). The sequential FT application that constitutes the basis for our parallel FVFT was provided by Philips Medical Systems.
References [1]
[2] [3] [4]
[5]
[6]
4
H. Afsarmanesh, R.G. Belleman, A.S.Z. Belloum, A. Benabdelkader, J.F.J. van den Brand, G.B. Eijkel, A. Frenkel, C. Garita, D.L. Groep, R.M.A. Heeren, Z.W. Hendrikse, L.O. Hertzberger, J.A. Kaandorp, E.C. Kaletas, V. Korkhov, C.T.A.M. de Laat, P.M.A. Sloot, D. Vasunin, A. Visser, and H.H. Yakali. VLAM-G: A grid-based virtual laboratory. Scientific Programming Journal, 10(2):173–181, 2002. S. Mori and P.C. van Zijl. Fiber tracking: principles and strategies - a technical review. NMR Biomed, 15(7-8):468–480, 2002. D. Xu, S. Mori, M. Solaiyappan, P.C. van Zijl, and C. Davatzikos. A framework for callosal fiber distribution analysis. Neuroimage, 17(3):1131–1143, 2002. N. Kang, J. Zhang, and E.S. Carlson. Fiber tracking by simulating diffusion process with diffusion kernels in human brain with DT-MRI data. Technical Report, Dept. of Comp.Science, Univ. of Kentuchy, 428-05, 2005. J. Zhang, L.J. Richards, P. Yarowski, P.C. van Zijl H. Huang, and S. Mori. Three dimensional anatomical characterization of the developing mouse brain by diffusion tensor microimaging. Neuroimage, 20(3):1639–1648, 2003. F.M. Vos, R.E. van Gelder, I.W.O. Serlie, J. Florie, C.Y. Nio, A.S. Glas, F.H. Post, R. Truyen, F.A. Gerritsen, and J. Stoker. Three-dimensional display modes for CT colonography: conventional 3D virtual colonoscopy versus unfolded cube projection. Radiology, 228:878–885, 2003.
The SDSC Storage Resource Broker (SRB). http://www.npaci.edu/dice/srb/.
A. Bucur et al. / A Grid Architecture for Medical Applications
[7]
[8]
[9]
137
R.G. Belleman and P.M.A. Sloot. Simulated vascular reconstruction in a virtual operating theatre. In H.U. Lemke, M.W. Vannier, K. Inamura, A.G. Farman, and K.Doi, editors, 15th International Congress and Exhibition, Computer Assisted Radiology and Surgery (CARS 2001), number 1230 in Excerpta Medica, International Congress Series, pages 938–944, Amsterdam, the Netherlands, June 2001. Elsevier Science B.V. ISBN 0-444-50866-X. Joy P. Ku, Mary T. Draney, Frank R. Arko, W. Anthony Lee, Frandics P. Chan, Norbert J. Pelc, Christopher K. Zarins, and Charles A. Taylor. In Vivo validation of numerical prediction of blood flow in arterial bypass grafts. Annals of Biomedical Engineering, 30:743–752, 2002. Ian Foster and Carl Kesselman. Globus: A metacomputing infrastructure toolkit. Intl. J. Supercomputer Applications, 11(2):115–128, 1997.
138
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Applications of GRID in Clinical Neurophysiology and Electrical Impedance Tomography of Brain Function a
J. FRITSCHY a, L. HORESH a, D. HOLDER a and R. BAYFORD a,b Department of Medical Physics & Bioengineering, University College London b School of Health and Social Sciences, Middlesex University
[email protected] Abstract. The computational requirements in Neurophysiology are increasing with the development of new analysis methods. The resources the GRID has to offer are ideally suited for this complex processing. A practical implementation of the GRID, Condor, has been assessed using a local cluster of 920 PCs. The reduction in processing time was assessed in spike recognition of the Electroencephalogram (EEG) in epilepsy using wavelets and the computationally demanding task of nonlinear image reconstruction with Electrical Impedance Tomography (EIT). Processing times were decreased by 25 and 40 times respectively. This represents a substantial improvement in processing time, but is still sub optimal due to factors such as shared access to resources and lack of checkpoints so that interrupted jobs had to be restarted. Future work will be to use these methods in non-linear EIT image reconstruction of brain function and methods for automated EEG analysis, if possible with further optimized GRID middleware. Keywords. Electrical Impedance Tomography, GRID, Condor and Matlab
1. Introduction Clinical Neurophysiology is the clinical discipline of investigation of the nervous system by electronic methods. The principal methods are EEG (Electroencephalography) – the recording of voltages on the scalp produced by the brain, EMG (Electromyography) – the recording of voltages produced by muscles or nerves, usually in the limbs, with skin or intramuscular electrodes and Evoked Potentials – the recording of voltages produced on the scalp by evoked physiological stimulation. The clinical practice of these methods has remained largely unchanged for half a century; although PCs are now widely used to record digitized data, they do little processing and function as data recorders. However, this is likely to change in the near future. New methods of data analysis, for example, chaos theory analysis to predict the likelihood of epileptic seizures [1], require substantial computing resources. In our research group at University College London (UCL), we are working on new methods such as these that will enable the power of modern computers to improve the information that can be obtained from these methods. Many of them can be performed on a local PC but there are some that may take hours, days or even weeks on a well specified PC (512MB of memory, Pentium 4 – 3Ghz) and data is required more urgently for diagnosis. We have therefore been investigating the use of the GRID to perform this processing. The concept is to develop user-friendly software, which could be used in an acute clinical setting, such as
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
139
Figure 1. EIT equipment: electrode box, acquisition system and portable computer.
a Casualty department or investigative outpatient clinic. The doctor or technician would acquire the data and transparently send it for processing, which would be performed in real time or at least over a few minutes, at remote resources over the GRID. One potential application lies in the automated analysis of EEG records for epileptic activity. The particular application is in investigation of subjects with severe epilepsy with a view to performing curative surgery, in which the abnormal part of the brain, which causes epilepsy, is removed. The position of the abnormal part of the brain must be ascertained first. One of the methods is to admit the subject to hospital for several days onto a “telemetry” ward where EEG and video are recorded continuously. After several epileptic seizures are captured, the traces are analysed to indicate the likely region of origin of the epileptic activity. It is possible for epileptic activity to occur in a subject without the occurrence of a physical seizure. All the several days of EEG are therefore analyzed in case there are some attacks or other activity not noticed by the ward staff or patient. This is usually done manually by technicians and is very time consuming. Software methods [2] have therefore been developed to automate this, so that only suspicious parts of the days of recordings are inspected. As the majority of epileptic activity seen in an EEG comprises a negative spike-shaped voltage deflection lasting less than 70 msec., this is termed “automated spike analysis”. Another application is the new method of Electrical Impedance Tomography (EIT) [3]. This is a recently developed medical imaging method in which tomographic “slice” images are rapidly produced using electrodes placed around the body. The equipment (Figure 1) comprises a box of electronics about the size of a video recorder, attached to a PC. It is safe, portable and inexpensive and has unique potential for recording brain images continuously at the bedside or acutely in casualty departments where it is not practicable to use conventional large brain scanners such as MRI or X-
140
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
ray CT. In EIT, multiple measurements of electrical impedance are made using four electrodes for each individual measurement. Typically, several hundred of samples are collected in a fraction of a second, using different combinations of 16 to 64 ECG type electrodes applied to the body part of interest. These are transformed into a tomographic image using a reconstruction algorithm [3]. In the early development of EIT, this could be achieved by a relatively simple algorithm, which employed back-projection [4,5] or a matrix sensitivity method in which a single matrix multiplication was performed [6–8]. These were not computationally demanding because they employed the assumption that voltages recorded on the exterior were linearly related to changes in impedance properties in the subject. More recently, non-linear algorithms have been developed, which reflect the true nonlinear relation between external voltages and internal impedance properties. In general, these require multiple iterations for solution. Our group has recently implemented such a non-linear algorithm [9], which has the additional demand that a fine finite element mesh of the head is employed. This has become necessary because a potentially important application lies in using EIT to image urgently on arrival in casualty departments in acute stroke subjects. It has recently been shown that brain damage can be minimized by the use of thrombolytic (clot-dissolving) drugs, but these must be administered with 3 hours and neuroimaging must be performed first, in order to exclude a brain hemorrhage, as the thrombolytic drugs cause these to extend. Linear reconstruction algorithms have been shown to work well in images where there is a change over time so that data can be normalized, but this is not the case for stroke, where no prior and posterior data sequences are available. Production of a data not normalized over time is much more demanding and we have shown that a non-linear method is needed [10]. Whereas previous linear solutions with simple head models could be reconstructed on a PC in a few seconds [11], the current non-linear method would take about 100 hours on a well-specified PC for each image. In some applications for EIT, such as for imaging epileptic seizures, many images need to be acquired, so this poses an unacceptable delay. 1.2. Purpose and Experimental Design The purpose of this paper is to assess whether a GRID resource [12–14], the Condor platform [15], which distributes computational tasks over heterogeneous computing resources, presents a significant improvement in this timing bottleneck. For this work, a cluster of 920 PCs available in different departments at UCL was used. For this study, we present two examples: wavelet analysis of epileptic activity in the EEG and a nonlinear EIT reconstruction algorithm. They were written in Matlab proprietary software; this posed a practical problem as specific Matlab libraries had to be installed on the remote PC cluster before use. Apart from the quantitative analysis, we evaluated the advantages and disadvantages of all the practicalities related with this work. 1.3. Explanation of Condor GRID Middleware 1.3.1. Physical UCL-Condor Architecture The Condor architecture can have many different configurations but there are three essential roles (Figure 2): the user’s machine, the central submitter and the executing nodes. The submitter machine is a node in the cluster from which the input data is sent
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
141
Figure 2. Typical Condor physical architecture.
to the nodes. The executing nodes are computers on the cluster that are configured for running jobs. Once the files are transferred from the user’s machine to the central submitter (step 1 in Figure 2), the execution is launched and the Condor operating system sends the jobs to different nodes on the cluster (step 2). Once the jobs are finished, the central submitter retrieve them (step 3) and finally the results are transferred to the user’s machine (step 4). 1.3.2. Logical Condor Architecture The three upper logical components (Figure 3): Application, Application Agent and Customer Agent, are on the client-side, in the central submitter, and the four lower components (Owner Agent, Remote Execution Agent, Local Resource Agent and Resource) are on the resource-side. In the middle, between the two sides, the Matchmaker marries up the components on either side. On the resource-side, the information about available resources is constantly updated in the four corresponding components, which have different levels of abstraction. At the end of this chain of information is the Owner Agent, which contains the knowledge regarding the amount of available nodes, their technical characteristics such as memory, operating systems and processor speed. It regularly transmits this information to the Matchmaker. (action 1 in Figure 3). Once the client sends tasks to the pool, the Customer Agent sends appropriate information such as number of tasks, amount of memory and operating system needed to the Matchmaker (action 1 in Figure 3). The Matchmaker then makes a decision as to which tasks will be executed, selects the resources, and allows direct communication between those (action 2). Afterwards, a chain reaction is started, which allows the communication between peers on different sides of the Matchmaker. At the end (action 6) the user of the application contacts the resource. 1.3.3. Relevant Concepts about Condor Related to Our Work In this study, Condor performance was assessed against a well-specified PC, in two different conditions: Full execution and satisfactory execution. Full execution is de-
142
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
Figure 3. Key logical objects in Condor architecture.
fined as when all the jobs sent to the Condor pool are completed. Satisfactory execution is when 70% of the jobs are completed. Since interpolation techniques can be applied to the algorithm that assembles all the finalized tasks, successful EIT reconstructions can be achieved with 70% of the information.
2. Methods 2.1. UCL-Condor Pool Specifications The UCL-Condor pool has 920 nodes (Pentium 3, 1GHz. CPU, 256MB RAM and 750MB of free disk space, operating system: Windows 2000). These machines are distributed around the UCL campus and are used by the students. 2.2. Matlab®-Libraries The tests presented in this work included executable files coded in Matlab [16]. Unfortunately, none of the nodes in the pool had Matlab installed. To be able to run a Matlab executable file in a machine that did not have Matlab installed, it was necessary that some DLL libraries were installed in a local directory. Matlab’s executable file (mglin-
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
143
Figure 4. Mesh of the brain (left) and the head (right) showing the position of the electrodes.
staller.exe) deploys these libraries into the local machine, and the directory in which these were deployed had to be added to the PATH variable of the machine [16]. 2.3. First Test In this test, a set of 100 tasks, which normally took 74 minutes each in a well specified PC, such as an Intel Pentium 4, 3 Ghz, 512MB RAM, was sent to the Condor pool. Each was a Matlab® algorithm, which performed wavelet analysis [17,18] on EEG data for epileptic spike detection. The algorithm implemented a Daubechies 4 wavelet analysis with a window of 1024 samples. This was run on a single channel of EEG acquired in a patient with epilepsy lasting 22 min sampled at 200 Hz. To run this algorithm on the executing nodes, a batch file executed five steps: 1) deploy the necessary Matlab libraries in the node (mglinstaller.exe), 2) set up the PATH variable, 3) run the desired executable file 4) retrieve the processed data and, finally 5) clean up and restore the nodes to their initial condition. 2.4. Second Test This was a reconstruction of multi-frequency EIT tank data using a non-linear absolute imaging reconstruction method. A set of 300 processes, each normally taking 2.5 hours in the above well-specified PC, was sent to the Condor pool. The reconstruction algorithm is a regularized search direction non-linear Polak-Riebeire Conjugate Gradients solution for the production of EIT images of the head [9]. It employed a forward solution in which the head is modeled as a finite element mesh (Figure 4) of 25000 elements with three layers, representing the scalp, skull and brain [19]. The conductivity values in this mesh were iteratively calculated in order to minimize the cost function compared to the boundary voltages measured with 32 electrodes in a saline filled tank, using a non-linear conjugate gradients method. 300 electrode combinations were used and the computation was stopped at 25 iterations.
144
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
3. Results In the first test, EEG analysis, while the execution of the 100 files (74 minutes each) would have required 7400 minutes (5.1 days) in a well-specified PC, Condor finished a full execution in 297 minutes. This was therefore 4 times the execution time for a single file, but still 25 times faster than a serial execution in a well-specified PC. In the second test, EIT image reconstruction, the 300 tasks would have required approximately 750 hours (31.2 days) in a well-specified PC whereas Condor completed a full execution in 1128 minutes (18.8 hours), which is 39.9 times faster. A satisfactory execution was completed in 346 minutes, which is 130 times faster than in a wellspecified PC. 4. Discussion Overall, use of the Condor pool substantially reduced the time taken to process these applications. Although the speed-up was between 25 and 130 times, Condor’s performance was slower than expected. The limitation for the speed was probably mainly due to: a) Load of the pool. Since the pool was used simultaneously by other jobs, in practice this clearly diluted processing time available compared to individual processing of one job by one PC. b) Absence of checkpoints. The Condor procedure lacked checkpoints, so that if a processor sharing interrupted a job before completion, it had to be restarted. This occurred in about 5% of jobs. Although this figure is low, it has a disproportionately large effect on total processing time because of the parallel execution. c) Job allocation. Condor normally randomly allocates jobs irrespective of size, so that the demanding ones may be allocated to a loaded machine. Implementation proved surprisingly time-consuming. There were four technical issues: a) We had to obtain an account in the central submitter and a security certificate b) We were operating under Matlab, which was not installed on any available cluster. We therefore had to obtain permission and establish a procedure to deploy Matlab libraries dynamically and delete them once the jobs were finished, c) Usually, the prototyping stage is carried out inside the Matlab environment. Condor required independent executable files, so time had to be spent in producing these with all appropriate libraries and variables inserted. d) Details of job execution, for the procedures used here, was not straightforward, as some continual updating of output files was needed for the iterative procedures in the EIT reconstruction. Although the Condor Job Description Language is easy to implement for simple strategies, this adaptation proved time consuming. The support for Condor issues, through the condor-users list [20], was excellent and the documentation is complete and has details and examples. A Matlab wrapper has recently become available, which should allow us to submit Condor jobs from inside the Matlab environment without paying attention to the presubmission tasks [21]. Since the pre-submission tasks such as compilation of the Matlab file, its transfer to the central submitter and preparation of the special submission file, are time consuming and prone to errors, it is likely that the use of the Matlab wrapper will confer considerable improvement in processing time. Another planned technical improvement is that Condor permits the user to implement special criteria in job submission such as looking for nodes with specific memory, processor speed and low probability of being diverted during execution. The tasks employed in this initial study were relatively simple but this important improvement in the processing power
J. Fritschy et al. / Applications of GRID in Clinical Neurophysiology
145
will allow us to test more complicated meshes that had been replaced by simpler ones because of the processing time bottleneck. We plan the use of non-linear methods, both for reconstruction of EIT images of brain function in stroke and epilepsy, and for automated analysis of the EEG in epilepsy and other neurological conditions. We anticipate that use of the GRID will greatly enhance this work, as it will give us the opportunity to test more sophisticated and powerful analysis algorithms.
Acknowledgements The authors wish to thank Clovis Chapman, the UCL-Condor pool’s administrator for providing support with the submission of the Matlab® code over the pool and Dr. Andrea Romsauerova for the EIT data. This work was supported by the BBSRC under an e-science grant.
References [1] Litt B, Echauz J.- Prediction of epileptic seizures.- Lancet Neurol. 2002 May;1(1):22-30. [2] Qu H. et al.- A patient-specific algorithm for the detection of seizure onset in long-term EEG monitoring: possible use as a warning device.- IEEE Trans Biomed Eng. 1997 Feb;44(2):115-22. [3] Holder, D.- Electrical Impedance Tomography: Methods, history and applications.- Institute of Physics Publishing: London 2004. ISBN: 0750309520. [4] Barber D., Brown B., Freeston I.- Imaging Spatial Distributions of Resistivity Using Applied Potential Tomography(1983).- Electron. Lett. 19: 933 – 935. [5] Barber D. - Image Reconstruction in Applied Potential Tomography, Electrical Impedance Tomography (1990). - Internal Report, Department of Medical Physics and Clinical Engineering, University of Sheffield. [6] Geselowitz D. - An Application of Electrocardiographic Lead Therory to Impedance Plethysmography (1971). - IEEE Trans. Biomed. Eng. 18: 38 – 41. [7] Kotre C. - A Sensitivity Coefficient Method for the reconstruction of Electrical Impedance Tomograms (1989). - Physiological Measurements 10: 275 – 281. [8] Kotre C. - EIT Image Reconstruction Using Sensitivity Weighted Filtered Back projection (1994). Physiological Measurements 15 (Suppl. 2A): 125 – 136. [9] Horesh et al.- Beyond the linear Domain. The way forward in MFEIT Image Reconstruction of the Human Head. - Proceedings ICEBI XII. Gdansk 2004. Page 499-502. [10] Yerworth R. et al.- Robustness of Linear and Nonlinear reconstructions algorithms for brain EITs. NonLinear – is it worth the effort? - Proceedings ICEBI XII. Gdansk 2004. Page 683-686. [11] Tidswell T. - Three-Dimensional Electrical Impedance Tomography of Human Brain Activity.NeuroImage. Volume13, issue 2, February 2001. [12] Foster I. - “What is the GRID?”- GridToday. July 22, 2002: Vol 1. No. 6. [13] Foster I., Kesselman C. et al - “The Anatomy of the Grid“- Globus project. Technical papers. http://www.globus.org/research/papers/anatomy.pdf. [14] Foster I. et al -“The Physiology of the Grid”. - Globus Project. Technical papers. http://www.globus. org/research/papers/ogsa.pdf. [15] Condor Project, http://www.cs.wisc.edu/condor/. ® [16] Matlab site. www.mathworks.com. “Building Stand-Alone Applications”. [17] Aboufadel E., Schlicker S - Discovering wavelets.- ISBN: 0471331937. [18] Burrus S. et al - Introduction to wavelets and wavelet transforms. - ISBN: 0134896009. [19] Liston et al. - A multi-shell algorithm to reconstruct EIT images of brain function. - Physiological Measurements. 23.1 (2002): 105-19. [20] http://www.cs.wisc.edu/condor/mail-lists/. [21] Eres M.- “User Deployment of Grid Toolkits to Engineers”. All Hands Meeting, Nottingham, September 2004.
146
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Parametric Optimization of a Model-Based Segmentation Algorithm for Cardiac MR Image Analysis: A Grid-Computing Approach S. ORDAS a,1, H.C. VAN ASSEN b, J. PUENTE c, B.P.F. LELIEVELDT b and A.F. FRANGI a a Computational Imaging Laboratory, Universitat Pompeu Fabra, Barcelona, Spain b Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands c GridSystems, S.A., Mallorca, Spain Abstract. In this work we present a Grid-based optimization approach performed on a set of parameters that affects both the geometric and grey-level appearance properties of a three-dimensional model-based algorithm for cardiac MRI segmentation. The search for optimal values was assessed by a Monte Carlo procedure using computational Grid technology. A series of segmentation runs were conducted on an evaluation database comprising 30 studies at two phases of the cardiac cycle (60 datasets), using three shape models constructed by different methods. For each of these model-patient combinations, six parameters were optimized in two steps: those which affect the grey-level properties of the algorithm first and those relating to the geometrical properties, secondly. Two post-processing tasks (one for each stage) collected and processed (in total) more than 70000 retrieved result files. Qualitative and quantitative validation of the fitting results indicates that the segmentation performance was greatly improved with the tuning. Based on the experienced benefits with the use of our middleware, and foreseeing the advent of large-scale tests and applications in cardiovascular imaging, we strongly believe that the use of Grid computing technology in medical image analysis constitutes a real necessity. Keywords. Parametric optimization, Grid computing, Statistical model-based segmentation, MRI analysis
Introduction In the last few years, many model-based approaches for image segmentation have contributed to the quite evolving field of medical image analysis. The rationale behind these methods is to analyze the image in a top-down fashion: a generic template model of the structure of interest is subsequently instantiated and deformed to accommodate for the clues provided by image information. For a cardiac application, it is possible to 1
Correspondence to: Sebastián Ordás, Computational Imaging Laboratory, Department of Technology, Universitat Pompeu Fabra, Passeig de Circumval.lacio 8, E08003 Barcelona, Spain. e-mail:
[email protected].
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
147
learn the shape statistics of a heart from a population of healthy and/or diseased hearts, and construct a compact and specific anatomical model. Statistical model-based approaches work in this way, and are able to provide constraints that allow for efficiently handling situations with substantial sparse or missing information. Statistical models of shape (ASM) [1] and appearance (AAM) [2] variability are two model-driven segmentation schemes very popular in medical image analysis. In building statistical models, a set of segmentations of the shape of interest is required, as well as a set of corresponding landmarks defined over them. An ASM comprises a shape and an appearance model. The former primarily holds information about the shape and its allowed variations in a Point Distribution Model (PDM), determined by a Principal Component Analysis (PCA). The latter is responsible of learning grey-level patterns from the training set image data, which are to be compared against those identified in a new (unknown) image during the fitting stage. The algorithm therefore consists of an iterative process in which the appearance model looks into the image for new candidate positions to deform the shape model, and the shape model applies statistical geometric constraints in order to keep the deformation process always within legal statistical limits. ASMs and AAMs are currently becoming a popular topic of research towards three-dimensional (3D) cardiac image segmentation. First approaches like [3,4] and [5,6] have evidenced an encouraging performance in segmenting the left ventricle in 3D datasets of MR/US and MR/CT, respectively. In the work that we present here, we describe the methodology employed to find the optimal set of parameters for the 3D-ASM segmentation algorithm of van Assen et al. [6] that uses a fuzzy inference system in the appearance model. Moreover, by running the segmentation tests with three different shape models, we aimed to explore to what extent the use of our method for automatically building shape models [7,8] does really improve the segmentation performance of a model-based fitting approach. The way our Grid middleware is designed to work allowed for quite an easy and general methodology for setting up, running, and post-processing the results of the parametric optimization. After presenting the general ideas behind the algorithm, we provide some comments on our experience with the use of Grid computing, and conclude the paper with a discussion. 1. Active Shape Modeling 1.1. Shape Model
{
}
Consider a set X = xi ; i = 1...n of n shapes. Each shape is described by the concate-
{ 1j
nation of m 3D landmarks p = p , p j
2j
}
, j = 1... m obtained from a surface ,p 3j
triangulation. X is thus a distribution in a 3m-dimensional space. This representation allows approximating any shape by using the following linear model
x = x + Φb
(1)
where x is the average landmark vector, b is the shape parameter vector of the model, and Φ is a matrix whose columns are the principal components of the covari-
148
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
Figure 1. Some plausible instances of the shape model. The shapes are generated by randomly setting the shape parameters within the limits of ± 3 SD
ance matrix S=
1 ( n − 1)
( SD = λ ) from their mean values in the training set. i
T ∑ in=1 ( xi − x)( xi − x ) . The principal components of S are the
eigenvectors, φi with corresponding eigenvalues, λi (sorted so that λi ≥ λi +1 ). If
Φ contains only the first
t < min {m, n} eigenvectors corresponding to the largest non-
zero eigenvalues, we can approximate any shape of the training set X , using Equation 1; where Φ = ( φ1 φ2 ... φt ) and b is a t-dimensional vector given by
T b = Φ ( x − x ) . Assuming that the cloud of landmark vectors follows a multi-
dimensional Gaussian distribution, the variance of the i-th parameter bi , across the training set is given by λi . The samples are normalized with respect to a reference coordinate frame, eliminating differences across objects due to rotation, translation and size. Once the shape samples are aligned, the remaining differences are solely shape related. By varying the model parameters bi in Equation 1, different instances of the shape class under analysis will result. Finally, by applying limits to the variation of bi , b ≤β λ i i
(2)
with β usually less than 3, it can be enforced that a generated shape is similar to the shapes contained in the training set. See Figure 1 for some examples of valid shapes generated by the model. 1.2. Appearance Model In order to deform the shape model, candidate points are collected from the image data using a decision scheme based on the Takagi-Sugeno Fuzzy Inference System (FIS) [9]. For a complete description of this decision scheme, we refer to [6]. The Fuzzy C-Means (FCM) within the FIS yields three membership distributions for the tissues: air, myocardium and blood. In the inference part of the FIS, crisp values per
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
149
Figure 2. Tissue classification threshold levels for the FCM.
pixel are derived based on the fuzzy membership distributions. This is performed using two kinds of thresholds (see Figure 2): 1. Gray level threshold: first, a threshold is placed in the membership distributions, marking a part that is attributed to a tissue class irrespective of its membership value for that class. All pixels with a gray value below the threshold are assigned to the airclass. The position of the threshold is determined as a proportion between tissue class centers. The threshold is placed at a preset proportion between the tissue class centers of the air and the myocardium tissue classes, resulting from the FCM. 2. Membership degree thresholds: the gray level threshold above divides the membership distributions from the FCM in two parts, assigning tissue labels to the lower part. The remaining part is classified using membership threshold levels. Thus, pixels with a gray value in this part are assigned the tissue label of the class with the highest membership value at that particular gray value, provided this membership value is above the threshold for the tissue. Pixels with grey values whose highest membership value does not reach the corresponding tissue membership threshold are left unclassified. 1.3. Using Sectors in FCM In the appearance model, the FCM fits a number of (Gaussian) gray value distributions to a given combined histogram of image patches extracted from the studied dataset. To ensure a large enough population of gray values for a stable application of FCM, all patches from all intersections of the model mesh with the image slices are collected. However, in many datasets, intensity inhomogeneity was observed within tissues. These effects can severely hamper the classification of the tissues based on sampled gray value histograms. To prevent this, the shape model was organized in sectors, by assignment of labels. Thus, different sectors can be assigned different rule sets in the FIS, and (combinations of) different sectors can be clustered in separate FCM operations. Since intensity inhomogeneity within one sector is limited, effects of inhomoge-
150
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm Table 1. Combinations of sectors in FCM operations. FCM operation
Sectors
Anatomical Location
0
1,2,3
Epicardial lateral wall
1
4
Epicardial RV wall
2
5
Epicardial apex
3
6
Endocardial lateral wall
4
7
Endocardial septum
neity on the fuzzy clustering outcome are diminished. In total, seven sectors were defined on the two surfaces of the model (see Figure 3). After grouping of sectors with approximately the same level of inhomogeneity, five groups remained, leading to five different FCM operations to cover the whole cardiac shape. Table 1 shows the combinations of sectors with respect to FCM operations.
2. Matching Procedure In summary, the 3D-ASM matching can be described as follows. The mean shape of the geometrical model is placed in the target dataset with a given initial scale and poses. The image planes intersect the model’s sub-part surface meshes (two in our case), yielding stacks of 2D contours. These contours are composed of the intersections of the image planes with individual mesh triangles. Candidate displacements (updatevectors) are propagated to the nodes in the mesh (see Figure 3). To facilitate throughplane motion in the model, they are projected on the normals of the model mesh, which also have a component perpendicular to the imaging planes. The current model state is aligned to the proposed model state resulting from the model update information using the method of Besl and McKay [10] for 3D point clouds, effectively eliminating scaling, translation and rotation differences. The residual shape differences are projected on the shape model subspace, yielding a model parameter vector. Thus, the proposed shape is approximated by the shape model using Equation 1, within statistical limits set beforehand by Equation 2. The steps above are repeated either for a fixed number of iterations, or until convergence is achieved.
3. Parametric Optimization 3.1. Parameters Related to the Shape Model For every intersection of the model mesh with an image plane, a model update is computed as explained before. For every single update, the possibility exists that a no optimal or even an erroneous position results. To diminish the effects of erroneous updates, the update itself, which acts as a force on a model surface, is smeared out over a small region on the model surface around the source location. Thus, faulty or less reliable updates can be compensated for by a number of neighboring updates. The contribution of a single model update to other updates is weighted with a Gaussian kernel, with the actual weight depending on the geodesic distance along the mesh edges to the source
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
151
Figure 3. Shape model sectors (left): (1-3) epicardial lateral wall, (4) epicardial RV wall, (5) epicardial apex, (6) endocardial lateral wall, (7) endocardial RV wall. Updated propagation from a single source (right) with the extent limited by a Gaussian kernel.
Table 2. Appearance model parameters, ranges and optimal values. Tissue
Lower Upper Step Size
ED Optimal
ES Optimal
Appearance model parameters Blood
0.1
0.5
0.1
0.2
0.4
Myocardium
0.05
0.3
0.05
0.05
0.05
Air
0.3
0.7
0.1
0.5
0.5
Shape model parameters Sigma, σ
3
9
1
6
4
Extent,˴
1
5
1
2
4
Beta, β
1
3
1
2
1
location of the update. To limit the extent of the propagation of the updates, propagation is stopped either after a fixed number of edges, or when the weight attributed to a particular update is below a preset threshold (see Figure. 3(b)). The actual values for the standard deviation of the kernel (sigma,σ ), the propagation level limit (extent, χ ), and the number of standard deviations (beta, β ) that each shape model parameter is allowed to vary (Equation 2), are the three shape-related parameters to optimize (see Table 2). 3.2. Parameters Related to the Appearance Model The three membership thresholds (horizontal in Figure 2) mentioned in Section 1.2 (for air, myocardium, and blood) constitute the three parameters to update (t1, t2, t3) in relationship with the appearance model. Their tuning ranges are specified in Table 2.
152
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
Figure 4. 3D-ASM segmentation. Search for candidate points at the initialization (left) and convergence (right) stages.
3.3. Fixed Settings The 3D-ASM was set to run for a fixed number of iterations (N=100) using 60 modes of variation (more than 95% of the total variance of the three shape models tested). Always, the shape resulting from the final iteration was taken for assessment of its point-to-surface (P2S) error with respect to its manual segmented counterpart. The optimal settings for the algorithm where chosen based on the unsigned P2S distance measures averaged over the complete evaluation database. 3.4. Evaluation Dataset The dataset used for the segmentation tests comprised 30 subjects. Fifteen were short axis scans of healthy volunteers acquired at the Leiden University Medical Center (Leiden, The Netherlands) using the balanced FFE-protocol on a Philips Gyroscan NT Intera, 1.5 T MR scanner (Philips Medical Systems, Best, Netherlands). The slice thickness was 8 mm, with a slice gap of 2 mm. The in-plane pixel resolution was 1.36 x 1.36 mm2. The other fifteen studies corresponded to patients from CETIR Sant Jordi Cardiovascular MR Centre (Barcelona, Spain) acquired using a General Electric Signa CV/i, 1.5 T scanner (General Electric, Milwaukee, USA) with the FIESTA scan protocol. The slice thickness was 8–10 mm with an in-plane pixel resolution of 1.56 x 1.56 mm2. Selected patients suffered from common cardiac diseases: myocardium infarction (10), hypertrophy (2), and pericarditis (3). Expert segmentations were manually drawn along the epicardial (LV-epi) and endocardial (LV-endo) borders of the left ventricle (LV), at two phases of the cardiac cycle: End Systole (ES) and End Diastole (ED). Manual segmentations usually have an average intra- and inter-observer variability in the range of 1–2 mm. Nevertheless, they generally constitute the gold standard for assessing the achieved accuracy. 4. Grid Computing Approach Our Grid middleware platform is the InnerGrid Nitya (GridSystems, Mallorca, Spain). An important aspect of this framework is that no special knowledge about Grid computing or particular hardware is required. This allows for easily setting up the system
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
153
Figure 5. 3D-ASM segmentation. Initial (left) and final (right) states of the LV-epi subpart. The triangulated shape corresponds to the fitted shape and the surfaced shape, to the manual one.
on a federation of off-the-shelf computers. The deployment of applications or services is usually straightforward and only requires some programming skills if it is intended to include the Grid in a more complicated workflow, comprising both distributed and nondistributed parts. The main components of the system are the Server, which is the coordinator of the Grid; the Agents, which are the pieces of software installed in all the machines federated in the Grid; and the Desktop Portal, a web graphical interface to the Grid and the applications running on it. Therefore the topology of the middleware is that of a Server connected to a large group of Agents that can be compiled for several platforms. When a service is requested, the Server looks at the information it receives from the Agents and decides which machines are able and idle enough to do the job related to that service. The Agents automatically receive all the elements they need, suitable for each operating system, and perform the requested task (which can be a part or the whole service). This includes any files, executables or parameters that might be needed. The Agents take care of managing all those components and perform their corresponding task. After everything is completed, the Server gathers the results and processes them. All this is hidden to the user and controlled by the Grid middleware. The Grid service runs on a 45-node dual Xeon (2.8 GHz CPU, 2 GHz RAM) cluster, under Linux Red Hat 9, and is accessible through a web interface (Grid desktop). As a whole, the cluster represents more than 103 Gflops of computing power. As each node provides two Agents to the Grid, a total of 90 Agents were available. While all used nodes belong to the same computer cluster, the addition of nodes from other clusters is straightforward. For setting up the experiments, 60 datasets (30 patients at ED and ES) and 6 shape models (three construction methods and a different shape model for ED and ES) were uploaded to the server. The tuning of the parameters that affect the appearance model and the beta parameter was run first. The remaining parameters were fixed at χ = 3 and σ = 6 (values beforehand expected to be not far from optimal). The process produced approximately 64800 results and lasted 1.5 days. The optimal set of parameters of this first step was used in a second run for tuning the other two parameters that affect the shape model. This second run lasted 0.9 days and produced 5670 result files. In both runs, the post-processing was done in a local machine but could have been performed in the Grid server itself, as a post-processing task. Each of the collected result files was in fact a compressed folder (.tgz) comprising log files (with the segmentation errors) and a series of intermediate results and shape model instances, intended to track for events and visualization.
154
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
Figure 6. 3D-ASM segmentation. Initial (left) and final (right) states of the LV-endo subpart. The triangulated shape corresponds to the fitted shape and the surfaced shape, to the manual one.
5. Quantitative Assessment The performance assessment analysis was performed using the model state (i.e. position, orientation, scale and shape) from the last matching iteration of the fitting algorithm. Errors were measured by computing the mean unsigned P2S distances from regularly sampled points on the manually segmented contours to the surfaces of the fitted shapes. Two patient datasets were discarded from the assessment because their automatic segmentations were not comparable to the quality of the rest (for all models). The uncorrected field inhomogeneity in one case and a severe pericarditis on the other confounded the algorithm for any combination of the tuned parameters. Table 3 presents the automatic segmentation results, distinguishing between the shape model subparts (LV-endo and LV-epi). Values correspond to the shape model with best performance (in fact, the autolandmarked shape model presented in [7]). It is also indicated between brackets the percentage of improvement achieved by the optimization with respect to previously achieved segmentation accuracy using ad hoc settings. The third row corresponds to the error that the shape model would have for reconstructing the shapes (either ED or ES) if their exact positions were known. The automatic results are quite within clinically acceptable margins. Figure 4 shows an example of the initial and final model states in a single slice, while Figures 5 and 6 the same, but with the LVendo and LV-epi subparts respectively. The manual shape is built from the manual contours and rendered as a surface. Figure 7 shows a typical result of the achieved segmentation.
6. Conclusions The presented work serves as an application example of Grid computing in a medical image analysis task. Specifically, the exhaustive search of the optimal set of parameters for a three-dimensional model-based algorithm for cardiac MRI segmentation. The achieved improvement in the segmentation accuracy with respect to the use of parameters beforehand expected to be not far from optimal, was important (22–27.7% and 13– 16.6% for ED and ES, respectively). From our experience with the middleware, we can
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
155
Figure 7. View across slices of the fitted shape.
say that the set-up for distributing such a big number of tasks was a matter of minutes. The system took advantage of each resource depending on its usage, and not interfering with end users. When one of the computers in the Grid was not available, its tasks were automatically reassigned. Finally, the system collected the local output from all the units and made them available for download. A curious fact worth mentioning is that the whole procedure was done in The Netherlands using the Grid desktop (web portal), while the computer cluster was in Spain. In conclusion, we believe that scientific progresses and derived clinical applications could be critically endangered by the lack of computational power or the inefficient scalability and expensiveness of traditional computational approaches. Grid technology solutions are quite valuable as they considerably shorten execution times of exhaustive searches and large-scale image processing, effectively enabling the shearing of computing resources between institutions. In the very active and evolving field of medical image analysis they have become a real necessity. References [1] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, “Active shape models - their training and application,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38–59, 1995. [2] T.F. Cootes, G.J. Edwards, and C.J. Taylor, “Active appearance models,” Proc. European Conf On Computer Vision, vol. 2, pp. 484–498, 1998. [3] S.C. Mitchell, J.G. Bosch, B.P.F. Lelieveldt, R.J. van der Geest, J.H.C Reiber, and M. Sonka, “3D active appearance models: Segmentation of cardiac MR and ultrasound images,” IEEE Trans Med Imaging, vol. 21, no. 9, pp. 1167–1179, 2002. [4] M. B. Stegmann, Generative Interpretation of Medical Images, Ph.D. thesis, Informatics and Mathematical Modeling, Technical University of Denmark, DTU, apr 2004. [5] H.C. van Assen, M.G. Danilouchkine, F. Behloul, H.J. Lamb, R.J. van der Geest, J.H.C. Reiber, and B.P.F. Lelieveldt, “Cardiac LV segmentation using a 3D active shape model driven by fuzzy inference,” Montreal, CA, Nov. 2003, vol. 2876 of Lect Notes Comp Science, pp. 533–540, Springer Verlag. [6] H.C. van Assen, M.G. Danilouchkine, M.S. Dirksen, J.H.C. Rieber, and B.P.F. Lelieveldt, “A 3D-ASM driven by fuzzy inference: Application to cardiac CT and MR,” IEEE Trans Med Imaging, 2004, submitted. [7] A.F. Frangi, D. Rueckert, J.A. Schnabel, and W.J. Niessen, “Automatic construction of multiple-object three-dimensional statistical shape models: Application to cardiac modeling,” IEEE Trans Med Imaging, vol. 21, no. 9, pp. 1151–1166, 2002. [8] S. Ordas, L. Boisrobert, M. Bossa, M. Laucelli, M. Huguet, S. Olmos, and A.F. Frangi, “Grid-enabled automatic construction of a two-chamber cardiac PDM from a large database of dynamic 3D shapes,” in IEEE International Symposium of Biomedical Imaging, 2004, pp. 416–419.
156
S. Ordas et al. / Parametric Optimization of a Model-Based Segmentation Algorithm
[9] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applications to modeling and control,” IEEE Transactions of Systems, Man and Cybernetics, vol. 1, no. 15, pp. 116–132, 1985. [10] P.J. Besl and N.D. McKay, “A method for registration of 3D shapes,” IEEE Trans Pattern Anal Machine Intell, vol. 14, no. 2, pp. 239–55, Feb. 1992.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
157
The MAGIC-5 Project: Medical Applications on a Grid Infrastructure Connection Ivan DE MITRI 1 Dipartimento di Fisica, Università di Lecce, Italy Istituto Nazionale di Fisica Nucleare, Sezione di Lecce, Italy On behalf of the MAGIC-5 Collaboration Abstract. The main purpose of the MAGIC-5 collaboration is the development of Computer Aided Detection (CAD) software for Medical Applications on distributed databases by means of a GRID Infrastructure Connection. A prototype of the system, based on the AliEn GRID Services is already available with a central Server running common services and several clients connecting to it. It has been already successfully used for applications in mammography together with a specific CAD developed within the collaboration. Applications to the case of malignant nodule detection in lung CT scans are now being implemented, while a use of the GRID services is also being applied to PET image analysis aiming at early Alzheimer disease. One of the future prospect of our project is the migration from AliEn to the EGEE/gLite middleware which is likely to become a European standard and will certainly provide more sophisticated tools with respect to the present AliEn functionality. In this work the status of the project and its future prospects will be given, with particular attention to the data management and processing aspects. Medical applications carried on by the collaboration will be also described together with the analysis of the results so far obtained. Keywords. Grid, Virtual Organization, Medical Image Processing, CAD system
1. Introduction It has been shown that screening programs are of paramount importance for early cancer diagnosis in asymptomatic subjects and consequent mortality reduction. The development of Computer Aided Detection (CAD) systems would significantly improve the prospects for the screenings, by working either as second reader to support the physicians’s diagnosis or as first reader to select images with highest cancer probability. The amount of data generated by such periodical examinations is so large that it can not be managed by a single computing center. As an example, let us consider a mammographic screening program to be carried out in Italy: it should check a target sample 1 Correspondence to: Ivan De Mitri, Dipartimento di Fisica, Università di Lecce, Via per Arnesano, 73100 Lecce, Italy. Tel.: +39 832 297443; Fax: +39 832 325128; E-mail:
[email protected].
158
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
of about 6.8 millions of women in the 49–69 age range, thus implying 3.4 millions of mammographic exams/year. For an average data size of 60 MB/exam, the amount of raw data would be of the order of 200 TB/year. On a European scale, the data source would be comparable to one of the next generation High Energy Physics (HEP) experiments (1–2 PB/year). In addition, the image collection in a large scale screening program intrinsically creates a distributed database, involving both hospitals, where the data are recorded, and diagnostic centers, where the radiologists should be able to query and analyze all the images and the related data. This amount of data grows with time and a full transfer over the network would be large enough to saturate the available connections. On the other hand, the need for making the whole database available, regardless of the data distribution, would provide several advantages. For example, a new CAD system could be trained on a much larger data sample, with an improvement of its performances in terms of both sensitivity and specificity. The CAD algorithms could be used as real time selectors of images with high cancer probability, with a remarkable reduction of the delay between acquisition and diagnosis. Moreover, the data associated to the images, also known as metadata, would be available to select the proper input for epidemiology studies or for the training of young radiologists. This framework requires huge distributed computing efforts as for the case of the HEP experiments, e.g. the CERN/LHC collaborations. The best way to tackle these demands is to use the GRID technologies to manage distributed databases and to allow real time remote diagnosis. This approach would provide access to the full database from everywhere, thus making possible large-scale screening programs. The MAGIC-5 project perfectly fits in this framework, as it aims at developing medical applications that make use of GRID Services, starting from a data model similar to that adopted by the ALICE collaboration [1]. In particular, the project is an evolution of a former activity – GP-CALMA [2] – that was mainly devoted to large scale mammographic screening. MAGIC-5 include those aspects together with new efforts in order to develop a CAD system for the detection of malignant nodules in lung CT scans and the management of the related distributed database. A new application devoted to the Alzheimer desease is now being implemented in the MAGIC-5 GRID infrastructure. From this point of view, the collaboration can be seen as one or more Virtual Organizations (VO), with common services (Data and Metadata Catalogue, Job Scheduler, Information System) and a number of distributed nodes providing computing and storage resources. There are three main differences with respect to the model applied to the HEP experiments, thwt can be summarized as follows: 1. some of the use cases require interactivity; 2. the network conditions do not allow the transfer of large amounts of data; 3. the local nodes (the hospitals) do not agree on the raw data transfer to other nodes. According to these constrains, the MAGIC-5 approach to the implementation of a prototype is based on two basic tools: AliEn [3] for the management of the common services, and PROOF [4] for the interactive analysis of remote data without transfer. A future prospect of our project is the migration from the AliEn middleware to the EGEE/gLite middleware which is likely to become a European standard and will certainly provide more sophisticated tools with respect to the present AliEn functionality. Data management and processing will be discussed in Section 2, while medical applications will be shown in the rest of the paper together with the description of the im-
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
159
Figure 1. A Screenshot from the GPCALMA AliEn WEB Portal. The acronym GPCALMA (Grid Platforms for Computer Aided Library for MAmmography) refers to the MAGIC-5 parent project. The site can be navigated through the left side frame. General Information about the AliEn project, the installation and configuration guides, the status of the VO Services can be accessed. The list of the core services is shown on the main frame, together with their status.
Figure 2. Simplified sketch of the screening use case (see text).
plemented CAD algorithms and their performances. Finally, the present status and the future plans of the project will be given in the Section 4.
2. Data Management and Processing A dedicated AliEn Server for the MAGIC-5 Collaboration has been configured [2], with a central Server running common services and several clients connected to it. Figure 1 shows a screenshot from the WEB Portal.
160
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
The images can be acquired in any hospital belonging to the Collaboration: the data are stored on local resources and registered to a common service, known as Data Catalogue, together with other related information, the metadata, required to select and access them at any future time. The result of a query can be used as input for the analysis through the various CAD systems, which are executed on nodes that are, usually, remote to the user, thanks to the ROOT/PROOF facility. A selection of the cancer candidates can be quickly performed and only images with high cancer probability would be transferred to the diagnostic sites and interactively analyzed by the radiologists. This approach avoids data transfers for images with a negative CAD response and allows an almost real time diagnosis for the images with high cancer probability. In order to make the images available to a remote Diagnostic Center, a mechanism able to identify the data corresponding to the exam in a site-independent way is used: the images are selected by means of a set of requirements on the attached metadata and identified through a Logical Name which is independent on the physical name on the local hard drive where they are stored. AliEn [3] implements these features in its Data Catalogue Services, run by the Server: data are registered making use of a hierarchical namespace for their Logical Names and the system keeps track of their association to the actual name of the physical files. In addition, it is possible to attach metadata to each level of the hierarchical namespace. The Data Catalogue can be browsed from the AliEn command line as well as from the Web portal. The metadata associated to the images can be divided into several categories: patient and exam identification data, results of the CAD algorithm analysis, radiologist’s diagnosis, histological diagnosis, and so on. Both tele-diagnosis and tele-training require interactivity in order to be fully exploited. The PROOF (Parallel ROOt Facility) system [4] provides the functionality required to run interactive parallel processes on a distributed cluster of computers. A dedicated cluster of several PCs was configured and the remote analysis of digitized mammograms without data transfer was recently run. The basic idea is that, whenever a list of input Logical Names is selected, it generates a list of physical names, one per image, consisting of the node name corresponding to the Storage Element where it is located, and the physical path on its file-system. The information is used to dynamically generate a C++ script driving the execution of the CAD algorithm, which is sent to the remote node. Its output is a list of positions and probabilities corresponding to the image regions identified as pathological by the CAD algorithm. Based on that, it is possible to decide whether the image retrieval is required for immediate analysis or not.
3. Medical Applications The medical applications of the MAGIC-5 Project cover two main fields: 1. breast cancer detection in mammographic images; 2. nodule detection in lung Computed-Tomography (CT) images; While the analysis of mammographic images started some years ago, the detection of malignant nodule in CT scans represents a new activity of the collaboration. In the following sections a brief review of the CAD systems till now developed in both applications will be given. Moreover, a GRID implementation for the diagnosis in Alzheimer disease (AD) will be mentioned as a neuro-application which is going to be implemented in a GRID environment by the collaboration.
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
161
Table 1. Composition of the MAGIC-5 mammographic image database. images with ML
images with MC
healthy images
1153
287
2322
3.1. Mammographic CAD Systems A database of mammographic images was acquired in the hospitals belonging to the collaboration. Pathological images have been diagnosed by experienced radiologists and confirmed by histological exam; they contain a full description of the pathology including radiological diagnosis, histological data, type and location. This information provides the truth, the CAD results will be compared with. Images with no sign of pathology were considered as healthy and included in the database after a follow up of three years. The images were digitized by means of a Linomed CCD scanner with 85 μm pitch and 12 bits per pixel. Each image is thus described by 2657 × 2067 pixels with G = 212 = 4096 grey level tones. Two different kinds of structures could mark the presence of a breast neoplasia: massive lesions (ML) and microcalcification clusters (MC). Massive lesions are rather large (diameter of the order of centimeters) objects with very different shapes, showing up with faint contrast (see Figure 3). Microcalcification clusters consist in groups of rather small (approximately from 0.1 to 1.0 mm in diameter) but very brilliant objects (see Figure 3). The database composition is reported in the Table 1. Different CAD systems have been developed for ML and MC detection. For both cases, the algorithms consist in three main steps: 1. segmentation: to perform an efficient detection in a reasonable amount of time, a reduction of the image size is required, without missing any pathology; to this purpose, some portions of the mammogram, having the highest probability to contain the pathology, are selected with a demand of efficiency as close as possible to 100%. An example of the segmentation processing is shown in Figure 3 for both massive lesion and microcalcification cluster detection. For a useful reference, the original images are also reported; 2. feature extraction: the portions of the mammogram extracted by the segmentation step are characterized by proper sets of features; 3. classification: the selected regions are used as inputs to a supervised two-layered feed-forward neural network whose output provides a degree of suspiciousness for the corresponding region. A detailed description of the CAD algorithms for massive lesion and microcalcification cluster detection is given in [5] and [6], respectively. Here we report in Table 2 the results obtained, in terms of sensitivity (fraction of correctly detected pathologies with respect to the total number diagnosed by the radiologist) and false positives per image (number of misclassified healthy ROIs per image), together with the total number of analyzed images for both cases. 3.1.1. The CAD Station The hardware requirements for the CAD station consists of a PC running Linux connected to a planar scanner and to a high resolution monitor. The station allows human
162
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
Figure 3. Massive Lesion segmentation is shown in the upper figure: the original image (left), the image without the ROI (middle) and the extracted ROI (right). Microcalcification detection is shown in the lower figure: the original image (left), and the image with the CAD results being superimposed (right).
Table 2. Performances of the MAGIC-5 mammographic CAD systems. sensitivity
FP/image
analyzed images
ML CAD
80%
3
3475
MC CAD
96%
0.3
278
or automatic analysis of the digital mammogram which can be directly acquired by the scanner or from a file. The software configuration for the local mode use requires the installation of ROOT [4] and GPCALMA, which can be downloaded in the form of source code from the respective CVS servers. A Graphic User Interface (GUI) has been developed (see Figure 4) to drive the execution of three basic functionalities related to the Data Catalogue: 1. registration of a new patient, based on the generation of a unique identifier, which could be easily replaced by the Social Security identification code; 2. registration of a new exam associated to an existing patient;
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
163
Figure 4. The Graphic User Interface. Three menus allow to browse the Patient, the Image and the CAD diagnosis levels. On the left mammogram, the CAD results for microcalcifications and masses are shown in red squares and green circles, respectively, together with the radiologist’s diagnosis (blue circle).
3. query to the Data Catalogue to retrieve all physical file names of the exams related to a patient and, eventually, analyze them. The images are displayed according to the standard format required by the radiologists: for each image, it is possible to insert or modify diagnosis and annotations, and to manually select the portion of the mammogram corresponding to the radiologist indication. An interactive procedure allows a number of operations such as zooming, windowing, gray levels and contrast selection, image inversion, luminosity tuning. The human analysis produces a diagnosis of the breast lesions in terms of kind, localization on the image, average dimensions and, if present, histological type. The automatic CAD procedure finds the Regions of Interest (ROIs) of the image with a probability of containing a pathological area larger than a pre-defined threshold value. 3.2. Malignant Nodule Detection in CT Scans The detection of malignant nodule in lung Computed-Tomography (CT) images represents the new-born activity of the MAGIC-5 Collaboration. Two initial steps for the development of a CAD system have been implemented: 1. the automated extraction of the pulmonary parenchyma; 2. the detection of nodule candidates based on a dot-enhancement filter. The first step aims at removing from the CT image all pixels located outside the chest. It is based on a combination of image processing techniques to identify the pulmonary parenchyma, such as threshold-based segmentation, morphological operators, border detection, border thinning, border reconstruction, and region filling. Figure 5 displays an example of a CT scan analyzed through the above mentioned techniques: the detected borders, drawn with a white line, contain the pulmonary parenchyma which is given as input to the nodule detection filter. In particular the figure reports the screenshot of a tool that has been developed within the collaboration in order to perform and optimize lung segmentation with different algorithms.
164
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
Figure 5. Screenshot of the user interface of a software tool that has been developed within the collaboration in order to perform and optimize lung segmentation with different algorithms. The results on a sample CT image is also reported.
The nodule detection step relies on the application of a dot-enhancement filter [7] to the 3D matrix of voxel data. A simple threshold-based peak-detection algorithm is then applied to the filter output. The above mentioned processing steps have been applied to sets of lung multi-slice CT scans acquired at high (standard setting: 120 kV, 120 mA) and low (screening setting: 120 kV, 20 mA) dose and with different slice thickness (1 mm and 5 mm). Each scan consists of a sequence of about 300 slices (for 1 mm slice thickness series) stored in the DICOM (Digital Imaging and COmmunications in Medicine) format. Preliminary results show that the filter identifies spherical nodules, thus being acceptable as a pre-processing stage for nodule detection. Moreover, the filter is effective for low-dose scans, a fact which is desirable in the view of a screening. Figure 6 displays an example of a correct nodule detection. Yet, the algorithm is sensible to some saddlelike configurations of blood vessels too, thus generating a high number of false positive peaks. Further processing stages are under study to eliminate this drawback. 3.3. A GRID Implementation for the Alzheimer Disease Diagnosis The Alzheimer disease (AD) is the leading cause of dementia in elderly people. Clinically, AD is characterized by a progressive loss of cognitive abilities and memory. One of the most widely used tool for the analysis of medical imaging volumes for neurological applications is the SPM (Statistical Parametric Mapping) software which has been developed by the Institute of Neurology at the University College in London. SPM provides a number of functionalities related to image processing and statistical analysis, such as segmentation, co-registration, normalization, parameter estimation, sta-
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
165
Figure 6. An example of a correct nodule detection through the application of a dot-enhancement filter [7] to the 3D matrix of voxel data, and threshold-based algorithm for peak-detection.
tistical mapping (see Figure 7). The quantitative comparison, through the SPM software, of PET images from suspected AD patients with the ones included in a database of normal cases, allows powerful suggestions to an early diagnosis of AD. To this purpose, the use of an integrated GRID environment for the remote and distributed processing of the PET images at a large scale, is strongly desirable. This application is now being implemented in the MAGIC-5 GRID infrastructure.
4. Present Status and Future Plans The GRID approach to the analysis of distributed medical data is very promising. The AliEn Server managing the VO services has been installed and configured, and some AliEn Clients are in use. The remote analysis of mammographic images successfully works, thanks to the PROOF facility. Presently, all blocks required for the implementation of the tele-diagnosis and screening use cases are integrated into a prototype system. The ROOT functionality has improved the GUI, which is now considered satisfactory by the radiologists involved in the project, due to the possibility to manipulate the image and the associated metadata. The MAGIC-5 GRID philosophy relies on the principle that the images are collected in the hospitals and analyzed by means of the CAD systems; only the images with a high probability to carry a pathology are moved over the network to the diagnostic centres, where the physicians can analyze them, almost in real time, by taking advantage of the CAD selection.
166
I. De Mitri / The Magic-5 Project: Medical Applications on a Grid Infrastructure Connection
Figure 7. Simplified sketch of the data processing flow for the Alzheimer desease diagnosis (see text).
A future prospect of our project is the migration from the AliEn middleware to the EGEE/gLite middleware which is likely to become a European standard and will certainly provide more sophisticated tools with respect to the present AliEn functionality. The medical applications are continuously under development. Both new algorithms (pulmonary CAD) and improvements of the existing ones (mammographic CADs) are under study. At the same time, part of the future work will be focused on the collection of a CT image database (at present, a limited number of scans is already available) and the implementation of the VO related to the PET image analysis for the early AD diagnosis.
References [1] http://alice.cern.ch. [2] U. Bottigli et al, GPCALMA: a tool for mammography with a GRID connected distributed database, Proc. of the Seventh Mexican Symp. on Medical physics 2003, vol.682/1, pag.67 also e-preprint physics/0410084. [3] http://alien.cern.ch. [4] http://root.cern.ch. [5] F. Fauci et al., Mammogram Segmentation by Contour Searching and Massive Lesion Classification with Neural Network, Proc. IEEE Medical Imaging Conference, October 16-22 2004, Rome, Italy. [6] C. S. Cheran et al., Detection and Classification of Microcalcifications Clusters in Digital Mammograms, Proc. IEEE Medical Imaging Conference, October 16-22 2004, Rome, Italy. [7] M. Aoyama, Q. Li, S. Katsuragawa, F. Li, S. Sone, K. Doi, Computerized scheme for Determination of the Likelihood measure of malignacy for pulmonary nodules on low-dose CT images, Medical Physics 30 (3), 387-441, 2003.
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
167
NEOBASE: Databasing the Neocortical Microcircuit Asif Jan MUHAMMAD 1 and Henry MARKRAM Brain Mind Institute, Ecole Polytechnique Fédérale De Lausanne (EPFL), Switzerland
Abstract. Mammals adapt to a rapidly changing world because of the sophisticated perceptual and cognitive function enabled by the neocortex. The neocortex, which has expanded to constitute nearly 80% of the human brain seems to have arisen from repeated duplication of a stereotypical template of neurons and synaptic circuits with subtle specializations in different brain regions and species. Determining the design and function of this microcircuitry is therefore of paramount importance to understanding normal and abnormal higher brain function. Recent advances in recording synaptically-coupled neurons has allowed rapid dissection of the neocortical microcircuitry thus yielding a massive amount of quantitative anatomical, electrical and gene expression data on the neurons and the synaptic circuits that connect the neurons. Due to the availability of the above mentioned data, it has now become imperative to database the neurons of the microcircuit and their synaptic connections. The NEOBASE project, aims to archive the neocortical microcircuit data in a manner that facilitates development of advanced data mining applications, statistical and bioinformatics analyses tools, custom microcircuit builders, and visualization and simulation applications. The database architecture is based on ROOT, a software environment that allows the construction of an object oriented database with numerous relational capabilities. The proposed architecture allows construction of a database that closely mimics the architecture of the real microcircuit, which facilitates the interface with virtually any application, allows for data format evolution, and aims for full interoperability with other databases. NEOBASE will provide an important resource and research tool for studying the microcircuit basis of normal and abnormal neocortical function. The database will be available to local as well as remote users using Grid based tools and technologies. Keywords. Databases, Data Management, Neocortical Microcircuit, Distributed Applications
Introduction The neocortical microcircuit is unique in that neocortical neurons are arranged in layers (layers I-VI) that connect to different cortical and sub-cortical regions (Jones, 1984; White, 1989). In the horizontal dimension, the neocortex is functionally parcellated into collaborative groups of neurons, commonly thought of as functional columns (Mountcastle, 1957; Hubel & Wiesel, 1962). In rodents, a neocortical column of about 0.3mm in diameter contains roughly 7500 neurons (100 neurons in Layer I; 2150 in Layer II/II; 1500 in Layer IV; 1250 in Layer V and 2500 in Layer VI). Most neocorti1
Correspondence to: Brain Mind Institute, Ecole Polytechnique Fédérale De Lausanne (EPFL), Switzerland; Email:
[email protected].
168
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
cal neurons (70-80%) are excitatory pyramidal neurons (Peters & Jones, 1984; White, 1989; DeFelipe & Farinas, 1992), which have relatively stereotyped anatomical, physiological and molecular properties (DeFelipe & Farinas, 1992; Toledo-Rodriguez et al., 2003). The remaining 20–30% of neocortical neurons are interneurons, mostly inhibitory interneurons, which have extremely diverse morphological, physiological and molecular characteristics (Houser et al., 1984; Peters & Jones, 1984; White, 1989; DeFelipe, 1993; Cauli et al., 1997; Kawaguchi & Kubota, 1997; DeFelipe, 2002; Toledo-Rodriguez et al., 2003). Our laboratory has recorded from over 1000 neocortical neurons, we have reconstructed over 500 of these neurons, and over 200 synaptic connections between specific neurons. We have also studied the expression of 50 genes in over 200 neurons. We are now accumulating data on the microcircuit at an even greater pace and estimate that we will have recorded from over 3000 neurons by 2008, which will provide a realistic sample of the different types of neurons and all major pathways in the neocortical microcircuit. We have also established industrial scale 3D computer reconstruction facilities which will yield 3D models and detailed morphometric data on virtually all the neurons. We have also developed a gene expression protocol with femtogramsensitivity for mRNA which enables single cell DNA microarray analysis and we have obtained routine access to a high throughput Affymetrix DNA microarray facility. We therefore aim to obtain the expression profile for all major cell types in the neocortical microcircuit within the next 3 years. In addition, progressively more labs are obtaining quantitative data on the neocortical microcircuitry that will add considerably to the volume and diversity of the data. It has now become imperative to database the microcircuit components and their connections. A unique style of database is however required to allow databasing of microcircuit data not only for archiving, but also for Neuroinformatics research, for building of custom versions of the microcircuit, for visualization of individual neurons, small groups of cells or the entire microcircuit, and for simulating the microcircuit at various levels of detail. A microcircuit database needs to be specific to the type of microcircuit studied because of the specific arrangement of and connections between neurons. We had earlier constructed a prototype database (Markram et al, 2003) to expose and work through the important issues of databasing microcircuit information. The knowledge and experience from the prototype was utilized for designing and building a futuristic platform, called NEOBASE, for storing the microcircuit data. The NEOBASE will organize the microcircuit data for optimal storage, knowledge sharing, analysis, and visualization and future simulations.
1. Data Model for Neurons and Synapses Neurons are characterized in terms of their morphological, physiological and gene expression profiles and synaptic connections are characterized in terms of their morphological and physiological profiles. Neuron morphology profiles are obtained from a detailed morphometric breakdown of 3D-reconstructed neurons (m-Profiles), neuron physiology profiles are obtained from detailed measurements of the electrophysiological responses to a series of stimulus protocols (e-Profiles), and neuron gene expression profiles are obtained from single cell RT-PCR and DNA microarray data (g-Profiles). Synaptic connections are characterized by the identity of the pre- and postsynaptic neurons (sn-profile), the axonal and dendritic location of light microscopically identified
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
169
putative synapses (sm-Profile), and the physiology of synaptic connections obtained from recording postsynaptic responses to a series of stimulation protocols applied to the presynaptic neuron (se-Profile). (Markram et al, 2003) provides a detailed treatment of the various Neuron and Synalse profiles, these profiles are briefly described in the following paragraphs. 1.1. Neuron Profiles The Neuron data is classified into Morphology, Electrophysiology and Gene Expression profiles. The following paragraphs describe various Neuron profiles. 1.1.1. Morphology Data (m-Profile) To obtain the m-Profile for a neuron, the neuron is fully 3D-computer reconstructed and converted into Neurolucida format (Glaser and Glaser, 1990). This 3D-model is then uploaded into the database and a MATLAB-based tool automatically performs an extensive morphometric analysis on the model neuron and enters the m-Profile into the database. The m-Profile is a vector of more than 200 values that represent various aspects of the geometry of the neuron. Examples of m-Profile data include TreeLengthMean (mean of lengths of segments with same order in each tree), IndivTreeLengthMean (mean of segment lengths in a tree), and XY_Angle (angle between projection of a segment on XY plane and X axis). 1.1.2. Electrophysiology Data (e-Profile) The electrophysiological profile of neurons is obtained by applying a series of different wave forms of current injection into the somata of a neuron, during intracellular or whole-cell patch-clamp recording. The responses to these pulses are measured to obtain a spectrum of electrophysiological parameters (EPs). Examples, of EPs include action potential active properties (such as action potential waveform, after action potential, and discharge parameters) as well as passive properties (such as input resistance, membrane rectification, and membrane time constants). 1.1.3. Gene Expression Data (g-Profile) The genetic profiles is obtained from single cell multiplex RT-PCR studies (nonquantitative and quantitative) and single cell DNA microarray analyses. The PCR studies that have been carried out so far enabled non-quantitative detection of the express vs non-expression of 50 genes (Toledo-Rodriguez et al., 2004). The g-Profile is divided into functionally characterized groups of genes such as calcium binding proteins, neuropeptides, neurotransmitter enzymes, structural proteins and more (Cauli et al., 1997; Toledo-Rodriguez et al., 2004). 1.2. Synaptic Profiles Multineuron recording allows simultaneous characterization the anatomy and physiology of synaptic connections between identified pre and postsynaptic neurons. The data collected about synapses is organized in the following profiles: 1.2.1. Morphology Data (sm-Profile) The anatomy of a synaptic connection is described by the sm-Profile. This profile contains information about the numbers of putative synapses, their location on the axonal
170
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
and dendritic arbors of the pre- and postsynaptic neuron, respectively (also referred to as Synaptic Innervation Patterns), and the axonal and dendritic geometric and electrotonic distances of each of the putative synapses (see (Markram et al., 1997; Wang et al., 2002)). The examples of sm-Profile parameters include Axonal Branch Order (number of branch points between the bouton forming the synapse and the soma of the source neuron), Dendritic Branch Order (the location of the synapse along the dendritic arbor according to the branching frequency of the dendritic tree, and Geometrical Distance (the distance along the dendritic from the synaptic location to the postsynaptic soma). 1.2.2. Electrophysiology Data (se-Profile) The electrophysiological properties of synapses are characterized in terms of the biophysical, quantal & dynamic properties (Markram et al., 1997; Gupta et al., 2000). The biophysical properties focus on the amplitudes, latencies, rise and decay times of PSPs and/or PSCs; synaptic conductances; synaptic charge transfer, etc. The quantal parameters include estimates of quantal size, probability of release and number of functional release sites. The dynamic properties include the time-constants governing the rates of recovery from synaptic depression (D) and facilitation (F) as well as the absolute and effective utilization of synaptic efficacy parameters. 1.2.3. Pharmacological Data (sp-Profile) The pharmacological properties are described in terms of their responses to various blockers, agonists and antagonists. Examples for commonly used chemicals are bicuculine (GABA-a antagonist), APV (NMDA receptor antagonist), CNQX (AMPA receptor antagonist), CGP 35348 (GABA-b antagonist), NMDA (NMDA receptor agonist), diazepam (GABA-a facilitator). The sp-Profile contains information describing the sensitivity of the synaptic connection to the different chemicals, and at which concentration. 1.3. Additional Profiles In addition to storing data about various Neuron and Synapse profiles as discussed above, the NEOBASE will also record Neuron Models and Canonical Data arranged in the following profiles: 1.3.1. Model Data (mod-Profile) The database will allow for the depositing NEURON (Hines, M 1994) models of each neuron. The NEURON model will include active properties by inclusion of ion channel constellations and parameters. Electrical properties of the neurons could be used in target functions to derive the optimal parameter settings to reproduce the electrical behavior of the neuron. Possible ion channel constellations could also be constrained by the ion channel genes that are found to be expressed by the different neurons. This section will therefore contain a complete model neuron for download and the mod-Profile will contain the values for the parameters of the model. 1.3.2. Canonical Data (x-Profile) As the database becomes heavily populated, the mean statistical properties of each neuron will be used to build canonical neurons for each type of neuron and each type of synaptic connection. The canonical data section will contain all information as for indi-
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
171
vidual neurons, including images, traces and all the neuronal and synaptic profiles. The section could also contain various degrees of simplification of the neuronal axonal and dendritic arborizations. mod-Profiles of canonical neurons will also be generated. The following paragraphs describe the illustrate the schematic layout of the NEOBASE, describe various system components, and present the current status of the project.
2. Implementation Details The following paragraphs describe the implementation details for the NEOBASE project. 2.1. Database Platform Databasing the microcircuit data requires a highly flexible platform that permits construction of an architecture which allows the data to be mapped, analyzed, and visualized as a circuit. One such highly flexible platform is provided by the ROOT system (Rene Brun and Fons Rademakers, 1996). ROOT is a highly evolved open source software environment built over many years at the European Organization for Nuclear Research (CERN), Switzerland. ROOT enables the construction of an object-oriented database with limitless relational capabilities, thus making it an ideal platform for databasing the microcircuit. ROOT consists of large number of elementary “classes” (currently over 300) that have been contributed by different researchers and allows for the construction of new custom classes as required by different applications and programs. A class is the blueprint for a specific behavior and for a list of attributes, but it has no existence on its own2 – analogous to an empty form that needs to be filled in. Once information is entered into the class, the class will behave accordingly and will contain the attributes entered. Entering of data in a class is referred as object instantiation i.e. an object corresponding to the data values is created and it captures all the attributes as described by the class and provides implementation for the behavior, in order to complete the desired task. However, objects in the object-oriented programming language can be arbitrarily complex, as they can contain not only a set of attributes, but also a set of operations to perform under various conditions. Additionally, they also form relationships with other objects according to specified rules. A collection of these objects and their interrelationships can be used to construct an object-oriented database which closely mimics the biological architecture. Capability of the objects to relate to each other is an important feature that will allow dynamic 3D reconstruction of thousands of neurons each contacting a specific part of a particular neuron. Furthermore, higher level classes will be designed to contain multiple objects which can be used to group, for example, neurons in different layers of the neocortex. The following figure provides the schematic layout of the NEOBASE system. As seen from the figure; the NEOBASE consists of three layers i.e. database layer, Web Portal and Web Services layer, and data access layer. The subsequent paragraphs describe the architectural as well as implementation details of these layers. 2
In the software terminology, classes that describe the behavior of some commonly applicable operations are sometimes declared as “static” classes, in that case the class provides the support for all operation and there are no objects that belong to this class.
172
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
Figure 1. Schematic layout of the NEOBASE system.
2.2. Database Layer In NEOBASE, we will model all elementary entities of the microcircuit, such as a specific gene, ion channel, or receptor as a class, called, for example GeneX, IonChannelX, ReceptorX, NeuroTransmitterX. These classes will allow the storage of multidimensional data. For example, the IonChannelX class will allow the storage of the gene object that makes the ion channel (constructed with GeneX), the amino acid sequence of the ion channel, information about the density and distribution of the ion channel, information about the ion channel kinetics and even a computer model of the ion channel. All genes, ion channels and receptors will be grouped into higher level classes called Genes, IonChannels, Receptors and Neurotransmittters respectively. Genes will be used to database all the genes found in a particular neuron and this collection of genes will be stored as an object. A neuron will be databased using a high level class called NeuronX. NeuronX will contain the objects belonging to Genes, IonChannels, Receptors, Neurotransmitters, MorphologyX and ElectricalX, ConnectionX classes. NeuronX can even contain an entire NEURON model of the neuron generated with NeuronModX. When the different classes are filled in for a particular neuron, the neuron will be intuitively composed of a specific profile of objects at various levels and various dimensions. Section 3 of the current documents briefly describes the current version of the database schema. 2.2.1. Database Schema Evolution The ROOT system supports the notion of the schema evolution i.e. where the description of the object classes is changed during the course of time. This allows the system to query and use the objects described using a combinations of old and new class definitions. The concept extends beyond reading simple objects and supports more complex situations i.e. objects with multiple levels of inheritance, as well. The schema evo-
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
173
lution is achieved in the following manner. With every object instance, the description of the class is stored in the dictionary objects. The dictionary contains, amongst other parameters, the version number for all the classes in the hierarchy. When reading a version of the object, its class definition is also read. This definition is then compared to the in-memory definition of the same class. In case there is a mismatch between the inmemory version of the class and the persistent version of the class, then the persistent definition is mapped to the in-memory definition. And the ROOT system takes care of changing the order of the data members, deleting the old members, adding new members etc. Thus, it will always be possible to read often inconsistent versions of same objects and differentiate their usage based on their version information and associated class definition. 2.2.2. Data Querying, Update and Retrieval Support for simple as well as complex search, update and data retrieval queries is one of the important characteristics of any database system. Additionally, the database system shall also assist the users in discovering the database metadata properties in order to construct meaningful single and multi attribute queries. The NEOBASE system will provide support for discovering the metadata attribute information for various objects in the database, allow for execution of multi attribute search queries, and provide efficient mechanisms for data update and retrieval. Special consideration has been given to the efficient execution of search operation in terms of reducing the storage requirements and improving the query response time. The ROOT framework provides, through ROOT trees, an optimized search strategy for querying large number of complex objects. For example, if we have large number of neuron objects, say 1 million of them, and we want to find neurons that have anatomical type as Pyramidal Cell (PC) belong to layer 2 and have total length of dendrite and axonal trees more than 300 micro meters. Loading each neuron object in the memory and checking its attributes for above mentioned values will have huge memory and processing requirements and will not be very efficient. Executing the above mentioned query in the ROOT will result in loading the individual attributes of the objects (as against to whole object) and comparing them against the search criteria. Thus the operation will have a very low memory signature and will scale for huge amount of object data. 2.3. Service Layer The main objective of the NEOBASE project is to make the data, about the neurons and synaptic connections, available for the benefit of the larger research community. The data will be freely accessible using standard internet technologies such as World Wide Web, to the interested users. In addition to providing data querying and browsing facilities to the individual scientists, we aim to accommodate a large number of third party application programs such as programs for constructing 3D representation of neocortical microcircuits, tools for analyzing neuron and synapse data based on various models etc. The data in the NEOBASE will be accessible, to the user, via an easy to use and comprehensive web portal. Additionally, the database will also be accessible, via a web service’s layer, to the third party application tools for circuit building, visualization and simulation. The web portal will facilitate individual users to connect to and browse the microcircuit data via standard web browsers. However, the application programs will
174
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
require a programmatic access to the database in order to extract required information. It is also important to mention that on the WWW, the multi dimensional data presentations of the neurons and neuron networks is limited by the facilities offered by the HTML and related technologies. For example, it will not be possible3, on the HTML based pages, to display the neurons as 3D objects capable of being rotated and zoomed into etc, due to various security and performance reasons. On the other hand using the web service’s layer, advance database browsing and visualization environments will be able to extract the data about neurons and synapses etc in the native format and provide a rich interactive multi-dimensional representation. Various database related services will be implemented as part of the current project. These services will be accessible via the web portal and the web services. These services include, but are not limited to, database querying and browsing, data upload and download tools, data analysis and processing program etc. 2.4. Data Access Layer The data, in the NEOBASE, will be accessed by the individual scientists as well as by the application programs for circuit reconstruction, visualization and simulation, among many others. The data access layer will provide functionality to interact with and query the data contained in the NEOBASE. The application programs are, however, responsible for interpreting data for their usage. The system will be accessible to web browsers as well as application programs i.e. for circuit building, visualization etc. The latter will use the web service’s layer for accessing the database.
3. Current Status A prototype version of the database has been implemented. The database consists of set of elementary classes corresponding to Neurons, Synapses and their different profiles. The data in the older version of the database (Markram et al., 2003) is being migrated to the new format. The Figure 2 describes the fundamental classes used for representing the Neocortical Microcircuit. The following paragraph provides a very brief description of the classes as described in the above figure. The Neuron class describes the general as well as profile specific properties of a microcircuit neuron. Similarly, the Synapse class represents the properties of a connection between two neurons. Both Neuron class and Synapse class contain data members corresponding to Attribute class and Profile classes respectively. The Attribute class describes a property of the neurons and synapses. These properties, in turn, may have values corresponding to the Value class. The Value class is extended to StringValue, BooleanValue, FloatValue, IntegerValue and CompositeValue in order to describe different data types. Similarly, the Profile class is subclasses into MorphologyProfile, ElectrophysiologyProfile, GeneExpressionProfile (and synapse profile classes) in order to represent different neuron and synapse profiles as described in the section 1 of the current paper. The current version of the database consists of classes facilitating storage of neuron and synapse data only. The strategy adapted, in the NEOBASE project, is to 3
Various technologies such as Java Applets allow for executing arbitrary complex code on the client machines, however these technologies are not widely used due to various security and performance reasons etc.
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
175
Figure 2. Database scheme for the NEOBASE project.
validate the database design by migrating the existing neuron and synapse data stored in the previous version of the database platform (Markram et al 2003). Subsequently, the framework will be extended to include classes for genes, ion channels, receptors, neurotransmitters as well.
4. Conclusion The neocortex subserves the most sophisticated perceptual and cognitive functions of mammals and occupies nearly 80% of the human brain. The microcircuitry of the neocortex lies at the heart of this immense computational power and deriving the blueprint of the neocortical design will be essential to fully understanding neocortical information processing. The microcircuit is immensely complex with 10’s of thousands of neurons making up a functional unit depending on the species and brain region. There are a large number of different types of neurons and each is intricately mapped onto a specific fraction of its neighbors using millions of synaptic connections, each with a char-
176
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
acteristic anatomy and physiology. It is therefore simply impossible to fully understand the neocortical microcircuit without a highly organized database. A progressively larger number of research labs are also now generating quantitative data that could greatly accelerate the quest to reconstruct the neocortical microcircuit. It is important to mention that while most of the data in the database will consist of rat somatosensory data obtained from brain slices, the task of dissecting the microcircuit in one species, age group and brain region is an immense one and will provide the foundation for selected and meaningful comparative studies in vivo, in other species and brain regions. NEOBASE will therefore expand, iterate and converge towards progressively more accurate descriptions of the neocortical microcircuities in many species and brain regions. Finally, recent studies strongly implicate microcircuit deficits, such as deficits in specific types of interneurons, in a large number of all neurological and psychiatric disorders due to migration and differentiation abnormalities, and being able to eventually simulate neocortical microcircuit behavior may be key to predicting the impact of such deficits on neocortical function.
Acknowledgments The NEOBASE builds on the earlier work as reported in the (Markram et al. 2003). We thank Rene Brun and Fons Rademakers at European Organization for Nuclear Research (CERN) for helping us to design an object oriented schema for the microcircuit database. We also thank Maria Toledo-Rodriguez and Gilad Silberberg at Brain Mind Institute (EPFL) for their help in the project conceptualization and initiation.
References Cauli B, Audinat E, Lambolez B, Angulo MC, Ropert N, Tsuzuki K, Hestrin S, Rossier J (1997) Molecular and physiological diversity of cortical non pyramidal cells. J Neurosci 17:3894-3906. DeFelipe, J. & Farinas, I. (1992). The pyramidal neuron of the cerebral cortex: morphological and chemical characteristics of the synaptic inputs. Prog Neurobiol. 39, 563-607. DeFelipe, J. (2002). Cortical interneurons: from Cajal to 2001. Prog Brain Res 136, 215-238. DeFelipe, J. (1993). Neocortical neuronal diversity: chemical heterogeneity revealed by colocalization studies of classic neurotransmitters, neuropeptides, calcium-binding proteins, and cell surface molecules. Cereb Cortex 3, 273-289. Glaser JR, Glaser EM (1990) Neuron imaging with Neurolucida--a PC-based system for image combining microscopy. Comput Med Imaging Graph 14:307-317. Gupta A, Wang Y, Markram H (2000) Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science 287:273-278. Hines, M. The NEURON simulation program. In: Neural Network Simulation Environments, 1994, p. 147163. Houser, C. R., Vaughn, J. E., Hendry, S. H. C., Jones, E. G. & Peters, A. (1984). GABA neurons in the cerebral cortex. In Cerebral cortex: functional properties of cortical cells, vol. 2. ed. Jones, E. G. & Peters, A., pp. 63-90. Plenum Press, New York. Hubel, D. H. & Wiesel, T. N. (1962). Receptive field, binocular interaction and functional architecture in the cat's visual cortex. J Physiolo 160, 106-154. Jones, E. G. (1984). Laminar distribution of cortical efferent cells. In Cellular Components of the Cerebral Cortex, vol. 1. ed. Peters, A. & Jones, E. G., pp. 521-554. Plenum Press, New York. Kawaguchi, Y. & Kubota, Y. (1997). GABAergic cell subtypes and their synaptic connections in rat frontal cortex. Cereb Cortex 7, 476-486. Markram, H., Lubke, J., Frotscher, M., Roth, A. & Sakmann, B. (1997). Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. J Physiol 500, 409-440.
A.J. Muhammad and H. Markram / NEOBASE: Databasing the Neocortical Microcircuit
177
Markram H, Xiaozhong L, Silberberg G, Toledo-Rodriguez M, and Gupta A.(2003) The Neocortical Microcircuit Database (NMDB). Databasing the Brain:From Data to Knowledge; Koslow SH and Subramaniam S (Ed) Wiley Press. Mountcastle, V. B. (1957). Modality and topographic properties of single neurons in a cat's somatosensory cortex. Journal of Neurophysiology 20, 408-434. Peters, A. & Jones, E. G., ed. (1984). Cellular Componenets of the Cerebral Cortex, vol. 1. Plenium Press, New York. Peters, A. & Jones, E. G. (1984). Classification of cortical neurons. In Cellular Components of the Cerebral Cortex, vol. 1. ed. Peters, A. & Jones, E. G., pp. 107-122. Plenum Press, New York. Rene Brun and Fons Rademakers (1996), ROOT - An Object Oriented Data Analysis Framework, Proceedings AIHENP'96 Workshop, Lausanne, Sep. 1996, Nucl. Inst. & Meth. in Phys. Res. A 389 (1997) 8186. See also http://root.cern.ch/. Toledo-Rodriguez M, Blumenfeld B, Wu C, Luo J, Attali B, Goodman P, Markram H (2004) Correlation Maps Allow Neuronal Electrical Properties to be Predicted from Single-cell Gene Expression Profiles in Rat Neocortex. Cereb Cortex. Toledo-Rodriguez, M., Gupta, A., Wang , Y., Wu, C. Z. & Markram, H. (2003). Neocortex: basic neuron types. In The Handbook of Brain Theory and Neural Networks. ed. Arbib, M. A. The MIT press., Cambridge, Massachusetts. Wang Y, Gupta A, Toledo-Rodriguez M, Wu CZ, Markram H (2002) Anatomical, physiological, molecular and circuit properties of nest basket cells in the developing somatosensory cortex. Cereb Cortex 12:395410. White, E. L. (1989). Cortical Circuits. Synaptic Organization of the Cerebral Cortex. Birkhauser, Boston.
This page intentionally left blank
Part 4 Ethical, Legal, Social and Security Issues
This page intentionally left blank
From Grid to Healthgrid T. Solomonides et al. (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
181
ARTEMIS: Towards a Secure Interoperability Infrastructure for Healthcare Information Systems Mike BONIFACE 1 and Paul WILKEN IT Innovation Centre, University of Southampton 2 Venture Road, Chilworth Science Park, Southampton SO16 7NP, UK
Abstract. The ARTEMIS project is developing a semantic web service based P2P interoperability infrastructure for healthcare information systems. The strict legislative framework in which these systems are deployed means that the interoperability of security and privacy mechanisms is an important requirement in supporting communication of electronic healthcare records across organisation boundaries. In ARTEMIS, healthcare providers define semantically annotated security and privacy policies for web services based on organisational requirements. The ARTEMIS mediator uses these semantic web service descriptions to broker between organisational policies by reasoning over security and clinical concept ontologies. Keywords. Healthcare information systems, security, semantic interoperability, web services, P2P
1. Introduction A typical healthcare provider will use many heterogeneous healthcare information systems to support the delivery of patient care each designed to perform specific function. Typically, these systems are standalone, developed by many different suppliers and are incompatible with one another. The non-interoperability of IT systems represents the biggest single problem in transferring data securely between different parts of a healthcare system [22]. In the ARTEMIS project [1,9] we are developing a semantic web service based P2P interoperability infrastructure for healthcare information systems that will support new ways of providing health and social care. Healthcare providers join an ARTEMIS network to access medical web services that enables access to electronic healthcare records maintained by other healthcare organisations. We use ontologies, derived from existing healthcare standards, to describe the semantics of web service operations and data [4]. Using these semantic service descriptions we provide ARTEMIS mediation super peers that enable heterogeneous healthcare information systems to interoperate. Describing the functional characteristics of services is only part of the story. Resolving non-functional service requirements such as security and privacy are also es1
Correspondence to: Mike Boniface, tel: +44 23 8076 0834, fax: 44 23 8076 0833, e-mail:
[email protected].
182
M. Boniface and P. Wilken / ARTEMIS: Towards a Secure Interoperability Infrastructure
Figure 1. Data privacy regulatory framework.
sential for interoperation. Healthcare information systems operate within a strict regulatory framework that is enforced to ensure the protection of personal data against processing and outlines conditions and rules in which processing is allowed (See Figure 1). There are many such regulations at European level [3,10] and additional legislation implemented within member states [6]. According to EU Directive 95/46/EC, if a healthcare provider maintains personal data on its patients the healthcare provider is identified as a data controller and is responsible for protecting that data against unauthorised use. Typically, a healthcare provider implements the legislation by authoring a security policy that mandates working practices and security technology requirements (key sizes, algorithms). If a healthcare provider wants to access personal data within another organisation they are identified as a data processor. For the communication to occur between data controller and data processor consent must be obtained from the patient and a contract between the two parties must exist that defines conditions such as the type of data processing and how long the data can be stored by the data processor. After the out-of-bound legislative conditions for data processing have been agreed there are still technical challenges in terms of security and privacy mechanisms that need to be resolved before electronic healthcare records can be automatically shared between healthcare information systems. In most cases healthcare providers have different security policies that state a diverse set of security requirements and capabilities. Authentication and authorisation mechanisms for healthcare professions may also be different. In this paper, we describe the ARTEMIS architecture and an approach for mediating between security and privacy policies using a combination of industry supported web service standards and reasoning over semantics web service descriptions.
2. Web Service Standards and Interoperability Web services have promised to provide a solution to complex interoperability problems through the use of open standards developed by organisations such as the W3C [25] and OASIS [20]. However, integrating heterogeneous systems based on standards does
M. Boniface and P. Wilken / ARTEMIS: Towards a Secure Interoperability Infrastructure
183
not equate directly to interoperability. The first difficulty is that the standards themselves can be complex, interpreted in different ways and implementations can provide different levels of compliance. In addition, the recent proliferation of sometimes competing web service standards for security and privacy such as WS-Security, WSSecurityPolicy, WS-Authorisation, WS-Privacy, WS-Trust and WS-SecureConversation only increases the possibility of incompatible systems. In the healthcare sector, where numerous standards already exist, this problem is well known and initiatives such as IHE [12] dictate how complex standards such as HL7 [11] and DICOM [8] should be utilized when implementing hospital workflows. In the web service community, to ensure some level of interoperability, leading vendors formed a group called WS-Interoperability (WS-I) [27]. WS-I defines so-called “profiles” – constrained ways to use Web Service standards. For example, WS-I Basic Profile 1.0 [28] specifies that only certain transport protocols should be used (even though WSDL [26] can accommodate others), so that vendors don’t have to implement all possible protocols in their frameworks. Even though many web service standards exist, only WS-I Basic Profile 1.0 (soon to be joined by Basic Security Profile 1.0 [29]), are widely supported by vendor’s toolkits [13,14]. The second barrier to interoperability is that current standards have focused on syntactic issues, which in most cases still require human readable specifications for service integration. To improve interoperability semantics are needed to allow software to understand the meaning of data and a service’s function allowing improved service discovery, automated orchestration and mediation. The importance of semantics for interoperability is well documented [5], however, semantic web technologies are not currently mainstream with little adoption by leading vendors. The level of support for web service standards and semantic web technologies in existing industrial toolkits is a key constraint for the integration of healthcare information systems (HIS) with the ARTEMIS infrastructure. If the toolkits do not support the standards and technologies the cost of joining could be prohibitive for HIS vendors. In ARTEMIS, we take a pragmatic approach by requiring only web services that conform to industrial implemented standards and manage semantics within the middleware. 3. Web Service Descriptions ARTEMIS middleware provides tools for authoring web services to provide access to existing healthcare information system services and electronic healthcare records. The discovery and advertisement of services by healthcare organisations are managed through a Peer-to-Peer network structure: each health information system is represented by an Artemis Peer node that communicates directly with an ‘ARTEMIS Mediator’ or ‘Superpeer.’ The web service descriptions are held at the mediator using standard web service repositories, however, the service descriptions are annotated with semantics so that the service operation, meaning of data and non-functional requirements can be understood by the infrastructure. Web services advertised on an ARTEMIS network are described using standard WSDL and WS-SecurityPolicy [30] annotated with semantics as shown in Figure 2. The semantic descriptions are stored within the ARTEMIS middleware and are structured using the OWL-S [21] ontology augmented with medical data and security/privacy ontologies. Services are functionally classified using the OWL-S service profile that is extended with functional concepts derived from HL7 trigger events. The
184
M. Boniface and P. Wilken / ARTEMIS: Towards a Secure Interoperability Infrastructure
Figure 2. ARTEMIS web service descriptions.
data semantics are represented using the OWL-S process model. The input and output process model parameters are associated with concepts from a clinical concept ontology (CCO). The CCO has been derived from the UMLS semantic network [18] and metathesaurus [19] to provide a rich set of terminology for describing medical data semantics. The use of HL7 and UMLS is not mandatory, as the ARTEMIS infrastructure can support arbitrary functional and clinical concept ontologies required by healthcare providers. 4. Security and Privacy Policy Mediation A core requirement in ARTEMIS is for very robust, but highly flexible approach to security and privacy. The approach supported by ARTEMIS is to allow healthcare providers to codify their particular preferences and requirements for data security (confidentiality, integrity) and privacy (authorisation and anonymisation) in accordance with overarching organisational security policies. Healthcare providers exposing web services require requesters to conform to certain security requirements. For example, a security policy may state that the requester is authenticated using X509 certificates or SAML assertions and that data integrity is verified using SHA1 digital signature algorithm. However, in practice both requester and provider may have different security requirements and capabilities and to achieve interoperability brokering between security policies may be required. Figure 3 shows how the ARTEMIS infrastructure supports mediation between security policies. Healthcare providers create standard WS-SecurityPolicy’s that define
M. Boniface and P. Wilken / ARTEMIS: Towards a Secure Interoperability Infrastructure
185
Figure 3. Security policy brokering.
the security requirements and capabilities. The security policies contain references to standard identifiers for algorithms such as http://www.w3.org/2000/09/xmldsig#rsasha1 for RSA-SHA1 digital signature algorithm. These identifiers are mapped to concepts within security ontologies that enable reasoning between various credential and security mechanisms at a semantic level. Ontologies already exist for this purpose [7]. When a web service is invoked, the ARTEMIS mediators act as brokers between requester and provider. For example, in Figure 3, healthcare organisation B may advertise a web service requiring Triple-DES encryption with chain blocking cipher (CBC). These security requirements are specified using WS-SecurityPolicy assertions as shown below: wsse:x509v3