Transactions on Large-Scale Data- and Knowledge-Centered Systems IV. Special Issue on Database Systems for Biomedical Applications (Lecture Notes in Computer Science)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

13 downloads 478 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6990

Abdelkader Hameurlain Josef Küng Roland Wagner Christian Böhm Johann Eder Claudia Plant (Eds.)

Transactions on Large-Scale Data- and KnowledgeCentered Systems IV Special Issue on Database Systems for Biomedical Applications

13

Editors-in-Chief Abdelkader Hameurlain Paul Sabatier University, Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne, 31062 Toulouse Cedex, France E-mail: [email protected] Josef Küng Roland Wagner University of Linz, FAW Altenbergerstraße 69, 4040 Linz, Austria E-mail: {j.kueng,rrwagner}@faw.at Guest Editors Christian Böhm Ludwig-Maximilians-Universität München Lehr- und Forschungseinheit für Datenbanksysteme Oettingenstraße 67, 80538 München, Germany E-mail: [email protected] Johann Eder Alpen Adria Universität Klagenfurt, Institut für Informatik-Systeme Universitätsstraße 65-67, 9020 Klagenfurt, Austria E-mail: [email protected] Claudia Plant Florida State University, Department of Scientific Computing 400 Dirac Science Library, Tallahassee, FL 32306-4120, USA E-mail: [email protected] ISSN 0302-9743 (LNCS) ISSN 1869-1994 (TLDKS) ISBN 978-3-642-23739-3 DOI 10.1007/978-3-642-23740-9

e-ISSN 1611-3349 (LNCS) e-ISBN 978-3-642-23740-9

Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935603 CR Subject Classification (1998): H.2, I.2.4, H.3, H.4, J.1, H.2.8 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in ist current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Life sciences, biology and medicine are highly dynamic and data intensive disciplines. During recent years, the character of basic research and good practice has dramatically changed. Advanced data acquisition technology enables unique insights into complex biological processes and diseases. Gene sequencing and protein and metabolite proﬁling have revolutionized biology. Modern imaging and signal recording techniques like electrophysiology, mass spectrometry and magnetic resonance imaging allow researchers to study various aspects of biomedical systems at diﬀerent levels of granularity, ranging from single cells to entire organisms. A large number of databases provide public access to literature, tools and experimental data. In principle, there has never been more information available to comprehensively answer high-level biomedical research questions, like how does the human brain distinguish diﬀerent objects? Or how will a certain ﬂu epidemic evolve? However, it is usually the case that only a tiny part of the relevant information to answer such a question is accessible to the biomedical expert. To avoid the situation of drowning in data but starving for knowledge, database technology needs to be closely integrated into biomedical research and clinical practice. This special issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems consists of nine contributions that demonstrate by means of case studies database solutions for various biomedical applications including comparative genomics, neuroscience and the electronic health record. All papers underwent an extensive review process. Two of the papers are journal versions of the best papers of the ITBAM (International Conference on Information Technology in Bio- and Medical Applications) 2010 conference. The remaining papers authored by leading experts in the ﬁeld have been obtained by an open call for papers. The high-quality contributions of this special issue demonstrate that research in databases and research in biomedicine can achieve fruitful synergy. The contributions highlight some of the major challenges emerging from the biomedical applications that inspire and promote current database research: 1) Management, organization and integration of massive amounts of heterogeneous data. Phillip Rautenberg et al. present a data management system for electrophysiological neuroscience data. Based on the relational model, this database eﬀectively organizes vast amounts of experimental and simulation data together with essential metadata and serves as an accessible platform for data and knowledge exchange among experts. Alessio Bechini and Ricardo Giannini propose a solution for organizing the vast amount of data produced by genotyping laboratories. A document management system enriched with ontology-guided semantic tagging provides ﬂexible

VI

Preface

data organization and can be eﬃciently integrated into the workﬂow of a typical genotyping laboratory. Joao Zamite et al. present the MEDCollector, a system for multi-source epidemic data collection. The Web is a rich source of real time epidemic data. However, the data is available in a wide range of formats and quality. Characterized by interoperability and modularity, MEDCollector enables data collection from multiple sources. Ontologies and workﬂows guarantee the consistency of the integrated information. 2) Bridging the semantic gap between high-level research questions and low-level data. Heri Ramampiaro and Chen Li focus on the special challenges of information retrieval in biomedical literature databases. Traditional information retrieval techniques suﬀer from the highly domain-speciﬁc and ambiguous terminology in biomedical texts. To meet the special requirements of biomedical documents, the BioTracer system suitably recombines and extends existing information retrieval technology and strongly involves user feedback. Idir A. Amarouche and colleagues propose an architecture for automatic composition of electronic health records from diverse data sources. Based on data as a service and an eﬀective query rewriting approach, the system can handle high-level queries of medical practitioners. Paola Bonfante and colleagues introduce the BIOBITS system enabling users to discover the evolutionary relationship between diﬀerent publicly available genomes. Based on modular database architecture, BIOBITS allows the user to ﬂexibly analyze the data over diﬀerent levels of abstraction ranging from single genes to large sections of the genome. Andreas Dander et al. present KD3, a software suite for workﬂow-based exploration of biomedical data. The software assists the biomedical expert in all steps of the knowledge discovery process guided by a concrete biomedical research question. As an example workﬂow, the authors illustrate the discovery of biomarker candidates for liver disease from breath gas analysis. 3) Privacy and eﬃciency. Systems biology research involves complicated heterogeneous workﬂows with distributed databases, online tools and local software. Also for privacy reasons, remote services and databases allow only limited customization, which causes an overhead in computation time and network trafﬁc. Hasan Jamil presents a novel collaborative model for data integration in life sciences respecting the special privacy requirements and preserving the autonomy of remote resources. Finally, Imen Bouraoui proposes a novel feature extraction technique for iris recognition. The human iris provides a high potential for reliable personal identiﬁcation. The proposed approach is comparable in performance to the state-ofthe-art in identiﬁcation accuracy while being more eﬃcient.

Preface

VII

We thank Abdelkader Hameurlain, Josef K¨ ung und Roland Wagner, the editors of TLDKS, for giving us the opportunity to serve as guest editors of this special issue. We further thank the reviewers for their eﬀort and constructive suggestions. Special thanks also to Gabriela Wagner for supporting us with the organization. We hope that you enjoy the papers and perhaps ﬁnd some inspiration for your interdisciplinary work in this exciting area. June 2011

Christian B¨ ohm Johann Eder Claudia Plant

Editorial Board

Hamideh Afsarmanesh Francesco Buccafurri Qiming Chen Tommaso Di Noia Georg Gottlob Anastasios Gounaris Theo H¨arder Sanjay Kumar Madria Vladimir Marik Dennis McLeod Mukesh Mohania Tetsuya Murai Gultekin Ozsoyoglu Oscar Pastor Torben Bach Pedersen G¨ unther Pernul Colette Rolland Makoto Takizawa David Taniar Yannis Vassiliou Yu Zheng

University of Amsterdam, The Netherlands Universit`a Mediterranea di Reggio Calabria, Italy HP-Lab, USA Politecnico di Bari, Italy Oxford University, UK Aristotle University of Thessaloniki, Greece Technical University of Kaiserslautern, Germany University of Missouri-Rolla, USA Technical University of Prague, Czech Republik University of Southern California, USA IBM India, India Hokkaido University, Japan Case Western Reserve University, USA Polytechnic University of Valencia, Spain Aalborg University, Denmark University of Regensburg, Germany Universit´e Paris1 Panth´eon Sorbonne, CRI, France Seikei University, Tokyo, Japan Monash University, Australia National Technical University of Athens, Greece Microsoft Research Asia, China

Reviewers Can Altinigneli Klaus Hahn Xiao He Bettina Konte Grigorios Loukides Son T. Mai Rosa Meo Oscar Pastor Lopez Michael Plavinski Andrew Zherdin

University of Munich, Germany HMGU Helmholtz Center Munich, Germany University of Munich, Germany University of Munich, Germany Vanderbilt University, USA University of Munich, Germany University of Torino, Italy Univ. Politecnica de Valencia, Spain University of Munich, Germany University of Munich, Germany

Table of Contents

Database Systems for Biomedical Applications A Database System for Electrophysiological Data . . . . . . . . . . . . . . . . . . . . Philipp L. Rautenberg, Andrey Sobolev, Andreas Herz, and Thomas Wachtler Management of Genotyping-Related Documents by Integrated Use of Semantic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessio Bechini and Riccardo Giannini MEDCollector: Multisource Epidemic Data Collector . . . . . . . . . . . . . . . . . Jo˜ ao Zamite, Fabr´ıcio A.B. Silva, Francisco Couto, and M´ ario J. Silva Supporting BioMedical Information Retrieval: The BioTracer Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heri Ramampiaro and Chen Li Electronic Health Record Data-as-a-Services Composition Based on Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Idir Amine Amarouche, Djamal Benslimane, Mahmoud Barhamgi, Michael Mrissa, and Zaia Alimazighi A Modular Database Architecture Enabled to Comparative Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paola Bonfante, Francesca Cordero, Stefano Ghignone, Dino Ienco, Luisa Lanfranco, Giorgio Leonardi, Rosa Meo, Stefania Montani, Luca Roversi, and Alessia Visconti [KD3 ] A Workﬂow-Based Application for Exploration of Biomedical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Dander, Michael Handler, Michael Netzer, Bernhard Pfeifer, Michael Seger, and Christian Baumgartner

1

15 40

73

95

124

148

A Secured Collaborative Model for Data Integration in Life Sciences . . . . Hasan Jamil

158

Flexible-ICA Algorithm for a Reliable Iris Recognition . . . . . . . . . . . . . . . . Imen Bouraoui, Salim Chitroub, and Ahmed Bouridane

188

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

209

A Database System for Electrophysiological Data Philipp L. Rautenberg, Andrey Sobolev, Andreas V.M. Herz, and Thomas Wachtler German Neuroinformatics Node, Department Biologie II, Ludwig-Maximilians-Universit¨ at M¨ unchen Grosshaderner Str. 2, 82152 Planegg-Martinsried, Germany {philipp.rautenberg,andrey.sobolev, andreas.herz,thomas.wachtler}@g-node.org http://www.g-node.org/ Abstract. Scientiﬁc progress depends increasingly on collaborative efforts that involve exchange of data and re-analysis of previously recorded data. A major obstacle to fully exploit the scientiﬁc potential of experimental data is the eﬀort it takes to access both data and metadata for application of speciﬁc analysis methods, for exchange with collaborators, or for further analysis some time after the initial study was completed. To cope with these challenges and to make data analysis, re-analysis, and data sharing eﬃcient, data together with metadata should be managed and accessed in a uniﬁed and reproducible way, so that the researcher can concentrate on the scientiﬁc questions rather than on problems of data management. We present a data management system for electrophysiological data based on well established relational database technology and domain-speciﬁc data models, together with mechanisms to account for the heterogeneity of electrophysiological data. This approach provides interfaces to analysis tools and programming languages that are commonly used in neurophysiology. It thus will enable researchers to seamlessly integrate data access into their daily laboratory workﬂow and eﬃciently perform management and selection of data in a systematic and largely automatized fashion for data sharing and analysis. Keywords: Electrophysiology, database.

1

Introduction

The human brain is one of the most complex biological systems. It contains more than 1011 nerve cells and 1015 synaptic connections, and its functional elements extend over more then ten orders of magnitude in space, from molecular pathways in individual synapses and neurons to the entire brain. Likewise, the dynamical processes that underlie brain function span orders of magnitude in time, from submillisecond molecular and cellular processes to the acquisition of life-long memories. The capabilities of the brain are based on the function of neurons, which transmit information by generating electrical signals, so called action potentials or spikes, that stimulate other neurons to which they are connected. These networks perform highly nonlinear and recurrent processing which severely limits any purely intuitive approach to understand brain function. A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 1–14, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

P.L. Rautenberg et al.

1.1

Cellular and Systems Neurophysiology

Trying to understand the brain’s functions requires investigation of the dynamical processes that underlie, shape, and regulate neural processing at a huge range of spatial and temporal levels. The experimental approaches to investigate neural functions reﬂect this variety. Non-invasive methods, such as electroencephalography or functional magnetic resonance imaging can be used to investigate brain activity at the macroscopic level (Fig. 1A), but the spatial and temporal resolutions of these methods are too coarse to measure neural signals at the levels of networks of neurons or even single neurons. To gain insight into the information processing in neurons and networks, electrophysiological techniques are indespensable (Fig. 1B). There is a large variety of experimental approaches in electrophysiology. Methods range from measuring the intracellular potential of a neuron in vitro with a single pipette electrode to the recording of both electrical potentials and spiking activity from the brain of a behaving animal with several hundreds of electrodes simultaneously. Progress in recording techniques and methodological approaches lead to ever increasing data volumes and complexity of data. In addition, the increasing complexity of the experiments lead to an increasing amount of additional information (metadata) necessary to describe and analyze the data. 1.2

Electrophysiology Data

Variety and heterogeneity of neurophysiological data pose considerable challenges for the goal of developing common databases. In electrophysiological experiments, electrical signals from neurons are recorded as time-series data such

l

2

1

Fig. 1. Illustration of neuroscientiﬁc data. A: Collage of a horizonal section of a human brain and a network of neurons. B: Single neuron and sketch of a patch-clamp experiment recording currents accross the membrane to illustrate diﬀerent aspects of neuroscientiﬁc data. Electrical responeses and properties like voltage traces or spike trains (1) can be recorded within living tissue. The membrane properties responsible for electrical potential of a neuron can be discribed by electric circuits (2), geometrical properties (3), mathematical equations, and by implemented software (4).

A Database System for Electrophysiological Data

3

as time-varying voltage signals. In spiking neurons, information is encoded in the sequence of action potentials ﬁred, and therefore often the times of these events are extracted from the recorded signal, yielding a representation of the activity of the neuron as spike trains. Apart from these common basic types of data, experiments can vary in virtually all experimental aspects, such as the number and types of electrodes, the species and recording site, the preparation, the stimulation paradigm, and of course all the detail parameters of the recording and stimulation equipment. To meaningfully analyze a given data set, much of this information has to be known, and to ensure reproducibility, all of it should be recorded as metadata. Organizing metadata in itself is a challenge and exchanging metadata requires standardized mechanisms such as a common format (www.g-node.org/odml) [8]. Regardless of how metadata are organized in detail, data management can be eﬃcient only if the metadata and the data are available together in a uniﬁed way, such that the information provided by the metadata can be taken into account when accessing the data. Providing metadata is therefore a crucial requirement for databases of experimental data. 1.3

Data Repositories for Electrophysiology Data

Compared to other disciplines in the life sciences, such as Genomics and Proteomics [6, 14], the neurosciences lag behind regarding the use of databases for the organization and exchange of data. Only recently attempts have started to integrate neuroscience databases [1, 9], and here the focus is mostly on anatomical data. There are only few data repositories providing access to physiological data. One of the ﬁrst attempts to build a database of neurophysiological data was the neurodatabase.org project [5], which was established 2004, funded by the NIMH Human Brain Project and created by the Laboratory of Neuroinformatics at Weill Medical College of Cornell University. In this project, an elaborate data model and format as well as a query protocol for the exchange of neurophysiological data was developed and the data are provided in this standard. The data sets available are typically data from publications and are available with an extensive amount of metadata. In 2008, the NSF initiated the funding program Collaborative Research in Computational Neuroscience (CRCNS), which included a project to set up a repository of neurophysiological data [15]. Several laboratories received funds to document sample datasets that were made available at the crcns.org web server. Format and documentation are diﬀerent for each dataset, and for most datasets stimulation data are available. Data repositories like neurodatabase.org or crcns.org host data that have been uploaded speciﬁcally for the purpose of making the data available to the public. Typically, these data are from studies that have been published and they may be used for further investigations after they have served their primary purpose. Other projects are pursuing a diﬀerent approach and provide data sharing infrastructure also for data that are not necessarily publically available. The CARMEN project (www.carmen.org.uk), funded by the British Engineering and Physical Sciences Research Council, provides a platform for data analysis and data

4

P.L. Rautenberg et al.

exchange where the owner of the data can keep the data private, or can make the data available to selected users or the public. The platform is intended as a virtual neurophysiology laboratory that provides not only infrastructure for data exchange but also services for data analysis. Data ﬁles can be organized in a folder hierarchy and can be shared on a ﬁle-by-ﬁle basis. During ﬁle upload, the user has the option to enter metadata describing the experiment in which the data were recorded. This is done via web forms that provide ﬁelds corresponding to the minimal metadata that were proposed by the Carmen consortium [7]. The German Neuroinformatics Node is a project funded by the German Ministry of Education and Research (BMBF) that also has the goal to establish a platform for the sharing of neurophysiological data [10]. The strategy pursued is to not only set up a repository of data ﬁles, but to provide a framework that scientists can use to manage, access, and work with their data within their local workﬂow – in a manner that allows for data sharing with the lab, collaborators, or the public without further obstacles. A key requirement is the ability to store and access both the recorded data and the metadata (Fig. 2A) together so that all information necessary for data analysis, re-analysis, and sharing is available in a uniﬁed way. This integration of data and metadata has two beneﬁts (Fig. 2B). First, data handling in the laboratory from recording to analysis becomes more eﬃcient and reproducible. Second, data sharing requires no further eﬀort because all the information is already available with the data. 1.4

Contributions

Here, we present a data management system for electrophysiological data that is based on relational database technology and domain-speciﬁc data models. We introduce mechanisms for data management that account for the heterogeneity of electrophysiological data. Furthermore, our approach provides interfaces to analysis tools and programming languages that are commonly used in neurophysiology. It thus will enable researchers to seamlessly integrate data access into their daily laboratory workﬂow and eﬃciently perform management and selection of data in a systematic and largely automatized fashion for data sharing and analysis. In the following, we ﬁrst introduce the concept of using relational database technology for data management to integrate various stages of scientiﬁc work within the electrophysiological domain. Then we address technical and implementational aspects to explain the approach in further detail and provide an example and conclusions.

2

Concepts

To build a system that can be integrated with the scientists workﬂow without barriers, diﬀerent tools and aspects of scientiﬁc data analysis have to be taken into account: the programming language and the persistent data storage as general tool and infrastructure, but also objects and working mechanisms that help e. g. to handle analysis workﬂows with according results, to access meta-data,

A Database System for Electrophysiological Data

5

to arrange data for simulations, or to exchange data and functionality in order to collaborate with other scientists. The single datum can be compared to a juggling ball amongst many that is thrown, caught, and processed around the entire scientiﬁc workbench. In order to allow the tools to handle data in the appropriate way, the data have to be structured appropriately. Therefore, we rely on the relational model [3]: Many relational database management systems (RDBMSs) can ensure referential integrity and data consistency, and additionally the relationships are stored within the database and not as additional annotation or hidden within individual code. Furthermore, with SQL, there exists an established standard of querying relational databases and most signiﬁcant programming languages provide APIs to access RDBMSs. Alternatives like NoSQL databases or object-oriented databases are more ﬂexible or closer to programming. However, those approaches shift aspects of referential integrity or data consistency towards the application layer, whereas RDBMS inherently provides for data integrity and consistency. The Neuroscientist within the Loop. Scientists treat their experimental data and results usually very conﬁdential and are reluctant to share especially before the publication of results. However, at a late point in the study it often turns out to be diﬃcult to migrate or share the data of a project, as recorded data and corresponding metadata are distributed across lab notebooks, raw data ﬁles, variables and functions within the source code, and other sources. We propose to avoid this situation and ensure data integrity by working with a structured persistent data storage from the beginning. Therefore, the design of our concept accounts for the ‘scientiﬁc workbench’ and extends it with tools that facilitate the use of RDBMSs. Integrating Data Access and Analysis Code. For the analysis of (experimental) data, it is important that scientists are able to re-use code that has already been written previously. Therefore, we use an established standard as interface to a database: RDBS account for that and there are established APIs for several languages available e.g. for Matlab, Python, C++, and R which are core languages used in the area of neuroscience. Procedures and Functions. Many procedures and functions are used to analyze data. Basic analyses like statistical descriptors or tests are used widely. Moving them into the database facilitates their application to data, their distribution to colleagues, and the automatization of entire analysis workﬂows. Data Models. Speciﬁc data structures correspond to speciﬁc areas within neuroscience (see Fig. 3, respresenting data of electrophysiological experiments). The speciﬁcation of such data structures facilitates the deduction of relational schemata, or source code that represents corresponding objects. We call the speciﬁc data structure together with its canonical deduction rules data model (DM). It consists of a set of objects and their relation with each other (see deﬁnition ‘DM’, later). A speciﬁc data model deﬁnes a set of core objects representing raw data that are commonly used within a speciﬁc area of neuroscience. For example, within

6

P.L. Rautenberg et al.

! "

! "

"

" #$" % & '

%"(

$) $)

Fig. 2. Raw data, metadata, and its access. A: Raw data comes together with meta data. For electrophysiological raw data, useful metadata gives information e.g. about experimental paradigms, experimental protocols, hardware devices, laboratories, and scientists. B: In order to be able to use raw data in an appropriate way, a database integrates and relates data with according metadata. Here, speciﬁc ﬁelds are identiﬁed to establish templates for corresponding schemata. These schemata can overlap, as illustrated for the two examples ‘ephys’ (schema to respresent data of electrical properties of neurons) and ‘morph’ (schema to represent morphological properties of a neuron). The database provides a standard interface which can be used by software and programming languages, that are commonly used by scientists, like Matlab, Python, or OpenOﬃce to import, export, analyze, visualize, and publish data.

A Database System for Electrophysiological Data

7

Table 1. The data model ‘neo simple’ has the two object RecordingPoint and Analogsignal. One Recordingpoint can have many AnalogSignals. object-name attribute-name RecordingPoint recordingpoint key AnalogSignal analogsignal key AnalogSignal recordingpoint key

attribute-type constraint integer primary key integer primary key integer foreign key

Table 2. This example of a conﬁguration confneo simple . In this example ‘unit’ is a data-attribute and, therefore, has to be unique: it is either a data-attribute of ‘RecordingPoint’ or a data-attribute of ‘AnalogSignal’ depending on the scenario the data describes (e.g. if the recording electrode determines unit (e.g. ‘mV’) ‘unit(datum)’ should be part of ‘RecordingPoint’). object-name RecordingPoint RecordingPoint AnalogSignal AnalogSignal AnalogSignal AnalogSignal

attribute-name scaling factor description frequency sampling points unit signal

attribute-type ﬂoat string ﬂoat integer string vector(ﬂoat)

the area of electrophysiology, this set contains recorded voltage signals, spike trains, and electrodes. For each set of raw data, there will be information that describes structure of the datasets in general and information that is speciﬁc to the individual data. To add lab speciﬁc information about objects of a data model, scientists can assign ‘data attributes’ to these objects. As these data attributes will diﬀer from lab to lab, we call them a ‘conﬁguration’ of a data model ‘xyz’ (e.g. confxyz ). Some attributes specify the structure via relationship constraints (‘relationship attribute’) while others describe data (‘data attribute’). Here we provide a formal deﬁnition of a data model (DM ∈ DM) and the attribute-assignment (‘conﬁguration’: confDM ∈ CDM ) of a data model (see Table 1, 2), with: O := {object-names: string}

(1)

T := {types} C := {relationship constraints}

(2) (3)

A := {{attibute-names: string} × T } DM := P(O × A × C)

(4) (5)

CDM := {C ∈ P(O × A), DM ∈ DM | | ∀(o, a), (o , a ) ∈ C : a = a ⇒

(6) (7)

⇒ o = o ∧ ∃(o , a , c ) ∈ DM : o = o}

(8)

8

P.L. Rautenberg et al.

Metadata. In addition to structural metadata (provided by the data model) and descriptive metadata (content of data attributes about speciﬁc instances), we distinguish between internal and external metadata: internal metadata about an instance originate from the same data model as the instance itself. External metadata about an instance are metadata that originates from an instance of a distinct data model. To account for the diversity of metadata required to describe diﬀerent datasets, further data models are needed to describe a set of objects representing external metadata (e.g. information about experiments, like ‘scientist’, ‘lab’, and ‘experimental paradigm’). In order to keep the data models independent, a ‘metadata’ instance is linked to a speciﬁc object by specifying a ‘gluing’-relationship (see below). Workflow. Accessing, analyzing, and storing data are essential actions in daily scientiﬁc work. Relationships between these actions, the applied parameters, and the resulting outcome can also be stored as relations and tuples within the relational model. Relations help to manage data and to trace actions of the workﬂow. Parts of data management is even be done automatically by the RDBS (depending on the schema), such as keeping data consistent. On the other hand, data integrates into an individual development environment with already existing code and scripts by using object relational mappings. Scientists stay ﬂexible and develop their own structures/schemata, using at the same time wellproven systems that keep referential integrity and data consistiency.

3

Technical and Implementational Details

We implemented our solution using PostgreSQL 9.0 as RDBMS. Among the many diﬀerent technical approaches for storing, retrieving, and managing of data, PostgreSQL is a powerful, open source object-relational database system. Its SQL implementation strongly conforms to the ANSI-SQL:2008 standard and, therefore, is widely compatible with other database systems. With version 9, PostgreSQL supports hot standby and replication which increases possibilities with respect to backup and performance. Further improvements important for neuroscientiﬁc applications have been introduced, such as the removal of limits of key and value sizes for ‘hstore’, which represents sets of (key, value) pairs within a single data ﬁeld. This is useful in various scenarios like storing semi-structured metadata that are rarely examined. Furthermore, the support for ‘arrays’ as parameters and return values to other procedural languages has been enhanced. This is especially important as many neuroscientiﬁc analysis tools work with vectors, which can be mapped to PostgreSQL-arrays. PostgreSQL supports Python (PL/Python) and R (PL/R). These two programming languages bring the database closer to the neuroscientist: Python has become a de facto standard in several areas for which most neuroscientiﬁc research simulators like NEURON, or NEST provide interfaces [2]. It provides also powerful packages for data analysis (e.g.: numpy, scipy) and plotting (e.g.

A Database System for Electrophysiological Data

9

matplotlib). R is used widely for statistical computing and graphics [13]. Together with these and other programming languages the RDBMS can oﬀer a wide range of functionality that the scientist needs and to ﬁne-tune the interface to the individual workspace. From the software development point of view it is also very interesting that PostgreSQL supports multiple schemata which separate database objects like tables, views, sequences, and functions into diﬀerent name-spaces. This enables scientists to divide the database into logically distinct parts while using constant connection to the server. Furthermore, schemata add another organization level to handle rules and permission, which facilitates the management of many users accessing one database for collaboration, or global data access. Practically, each user has an individual schema for each project with an administrator-role for this schema. In order to keep the neuroscientist within the loop and allow for smooth codecooperation between scientists, we suggest some conventions for schema structures and data models. This also lowers entry barriers for new features, which will appear immediately familiar if they adhere to the same conventions. Those conventions are being developed and extended according to the requirements of the community. For example, all names of DB-objects that belong to a speciﬁc data model start with the name of the model followed by ‘ ’ (e. g.: ‘neo ’, see example, later), or primary keys consists only of a single column of integers. 3.1 Data Model with Data and Internal Metadata To deﬁne a data model of core objects within the area of electrophysiological raw data, we adopt the approach developed in the neo project (neo: Neural Ensemble Object, see http://neuralensemble.org/trac/neo). This project provides common class names and concepts for dealing with electrophysiological (experimental and/or simulated) data. One aim is to provide a basis for interoperability and integration of projects like OpenElectrophy [4], NeuroTools (http://neuralensemble.org/trac/NeuroTools), G-Node [10], and other projects with similar goals [12]. The central data objects in the area of electrophysiology are the time series, such as voltage traces recorded from neurons. neo represents time series by the AnalogSignal object. But the AnalogSignal is useless without its corresponding metadata. neo provides this internal metadata in two canonical ways: each instance of an object (like AnalogSignal) is embedded into a hierarchical structure which provides the necessary implicit metadata in order to relate this instance with other instances correctly. Those other related instances (e.g. an instance of RecordingPoint) provide explicit metadata together with their data attributes (e.g. ‘scaling factor’, or ‘polarity’, ‘recording unit’) . The design of this hierarchical structure reﬂects a typical setup of an experimental paradigm: one or more electrodes (RecordingPoint) measure simultaneously for a speciﬁc period of time (Segment) the activity (SpikeTrain, AnalogSignal) of neurons (SingleUnit). To identify a speciﬁc AnalogSignal, the corresponding RecordingPoint and Segment could serve as space-time coordinates, thus using metadata to select raw data.

10

P.L. Rautenberg et al.

(

(

)

0

' *

+

,,

/

/

-

.

!" #$ # %%%

)

.

!" #$ # %%%

&$

0

!" #$ # %%%

*

-

''

+

'''

''

&

#

!"

#

%$ !

$%

%$Fig. 3. Integrating speciﬁc data models into the workﬂow of neuroscientists. Neuroscientists use diﬀerent open source tools. Here we illustrate the data model neo and how diﬀerent developments within the neuroscientiﬁc community can be supported and integrated into the neuroscientiﬁc workﬂow. neo.core describes a non-cyclic hierarchical relation between classes. The hierarchy reﬂects the structure of a typical experimental paradigm: In an experimental session (Block) during a trial (Segment), one or more electrodes (RecordingPoint) measure simultaneously the voltage traces of neurons (AnalogSignals). Analyzing these AnalogSignals of RecordingPoints, Spikes are detected and assigned to SingleUnits (corresponding to a neuron, causing the SpikeTrain). Events reﬂect a point in time (Epoch: time period) where e.g. an stimulus was presented. We can map data (grey) to diﬀerent ﬁle formats but also to relational databases. Neuroscientists analyze data using their custom software (neuroscientiﬁc SCRIPTS ). To keep the individual workbench as independent as possible, speciﬁc analysis tools can be integrated into the database. For example, in a PostgreSQL-instance entire simulators like pyNEURON [11] or Analysis packages like ‘NeuroTools’ (both based on Python: grey stripes) can be used because PostgreSQL supports Python as procedural language.

A Database System for Electrophysiological Data

11

Fig. 4. ‘Gluing’ diﬀerent data models on the backend side. The schema at the left bottom reﬂects the data model neo whereas the schema on the right top reﬂects the data model ‘attr’ which stores data about controlled variables of the experimental protocol. Both models stay unchanged to keep software dependencies valid. They are related by an additional relation which is ‘gluing’ them together. With this extension at the backend and an according extention at the script level, scientists can analyze neodata taking corresponding attr-data into account without changing the core structure of the individual data models.

This illustrates how data model can reﬂect data acquisition. For data analysis, the neuroscientiﬁc community developed packages (e.g. ‘NeuroTools’) that provide tools to analyze this data. In Fig. 3 we illustrate the structure for integrating these packages, neo, and the data storage. 3.2

Extending the Model

Our approach is highly modular by keeping data models independent and providing a way to relate diﬀerent data models with each other. This supports a decentralized development that is at the same time easily extensible for individual requirements. Diﬀerent data models are combined by ‘glues’ which is not modifying the structure of the data models but supplements structure about the relationship between the data. Fig. 4 exempliﬁes this concept on the level of a database schema: the data model ‘attr’ stores data about the attributes of an experimental paradigms and is glued to the neo data model. 3.3

Application and Examples

Through the SQL interface the data are accessible from various applications. This enables the researcher to integrate data storage and data management with the analysis tools used in the laboratory (Fig. 5). Since not only the recorded data

12

P.L. Rautenberg et al.

Fig. 5. Accessing the data from diﬀerent applications. Top: Data visualization and analysis tools can use the database interface to access both data and metadata. The ﬁgure shows diﬀerent applications (the web-based G-Node data platform, OpenElectrophy, and OpenOﬃce) accessing and/or plotting the same data, a recorded voltage signal, from the same database. Bottom: Using a speciﬁed data model establishes the possibility to write adjusted modules that use structural and descriptive metadata in order to collect all information needed e.g. for plotting the data.

A Database System for Electrophysiological Data

13

but also all the metadata are accessible in this uniﬁed way, all information about the data is available. This provides high eﬃciency for the entire data analysis workﬂow, from ﬁrst inspection of the recorded data, to adequate selection of data according to experimental parameters, and even to the step of writing the publication, for which the necessary information about the experiment is readily available.

4

Conclusion and Outlook

Neurosciences are faced with increasing amounts and complexity of experimental data. To cope with these challenges and to make data analysis, re-analysis, and data sharing eﬃcient, data together with metadata should be managed and accessed in a uniﬁed and reproducible way, so that the researcher can concentrate on the scientiﬁc questions rather than on problems of data management. We have presented a solution to this challenge that builds on database technology to provide access to all information necessary to meaningfully analyze electrophysiological data already at the local neuroscientiﬁc workbench. We built a system that can be integrated to the scientists workﬂow without barriers and that keeps the scientist within the loop. As a key technology we use a RDBMS, more precisely a PostgreSQL server. It ensures referential integrity and data consistency, and additionally the relationships are stored within the database. It allows at the same time to use function and triggers written in Python or R, which are core languages within the neuroscientiﬁc community. We illustrated the application of our approach with the data model neo which was developed by the community (http://www.neuralensemble.org/) for dealing with neurophysiological data. In addition, we demonstrated how neuroscientists can extend it in a modular way. This approach will enable researchers to seamlessly integrate data access into the laboratory workﬂow, exchange data with collaborators, and eﬃciently perform management and selection of data in a systematic and largely automatized fashion for data sharing and analysis. Acknowledgments. We thank Holger Blasum, Felix Franke, Christian Garbers, Nina Hubig, Oliver Janitza, Christian Kellner, Philipp Meier, Roman Moucek, Raphael Ritz, Michael Schmuker, Adrian Stoewer, Michael Stransky and Alvaro Tejero Cantero for discussions and helpful criticism, the Neural Ensemble group for discussions on the neo objects, and the open source community for providing great software. Supported by BMBF Grant 01GQ0801.

References 1. Amari, S.-I., Beltrame, F., Bjaalie, J.G., Dalkara, T., De Schutter, E., Egan, G.F., Goddard, N.H., Gonzalez, C., Grillner, S., Herz, A., Hoﬀmann, K.-P., Jaaskelainen, I., Koslow, S.H., Lee, S.-Y., Matthiessen, L., Miller, P.L., Da Silva, F.M., Novak, M., Ravindranath, V., Ritz, R., Ruotsalainen, U., Sebestra, V., Subramaniam, S., Tang, Y., Toga, A.W., Usui, S., Van Pelt, J., Verschure, P., Willshaw, D., Wrobel, A.: Neuroinformatics: the integration of shared databases and tools towards integrative neuroscience. Journal of Integrative Neuroscience 1(2), 117–128 (2002)

14

P.L. Rautenberg et al.

2. Cannon, R.C., Gewaltig, M.-O., Gleeson, P., Bhalla, U.S., Cornelis, H., Hines, M.L., Howell, F.W., Muller, E., Stiles, J.R., Wils, S., De Schutter, E.: Interoperability of Neuroscience Modeling Software: Current Status and Future Directions. Neuroinformatics 5(2), 127–138 (2007) 3. Codd, E.F.: A relational model of data for large shared data banks. 1970. M.D. Computing: Computers in Medical Practice 15(3), 162–166 (1970) 4. Garcia, S., Fourcaud-Trocm´e, N.: OpenElectrophy: An Electrophysiological Dataand Analysis-Sharing Framework. Frontiers in neuroinformatics 3(May) 14 (2009) 5. Gardner, D., Knuth, K.H., Abato, M., Erde, S.M., White, T., DeBellis, R., Gardner, E.P.: Common data model for neuroscience data and data model exchange. J. Am. Med. Inform. Assoc. 8(1), 17–33 (2001) 6. Gelbart, W.M., Crosby, M., Matthews, B., Rindone, W.P., Chillemi, J., Russo Twombly, S., Emmert, D., Ashburner, M., Drysdale, R.A., Whitﬁeld, E., Millburn, G.H., de Grey, A., Kaufman, T., Matthews, K., Gilbert, D., Strelets, V., Tolstoshev, C.: Flybase: a drosophila database. the ﬂybase consortium. Nucleic Acids Res. 25(1), 63–66 (1997) 7. Gibson, F., Austin, J., Ingram, C., Fletcher, M., Jackson, T., Jessop, M., Knowles, A., Liang, B., Lord, G., Pitsilis, P., Periorellis, P., Simonotto, J., Watson, P., Smith, L.: The carmen virtual laboratory: Web-based paradigms for collaboration in neurophysiology. In: 6th International Meeting on Substrate-Integrated Microelectrodes 2008 (2008) 8. Grewe, J., Wachtler, T., Benda, J.: odML format and terminologies for automated handling of (meta)data. In: Front. Neurosci. Conference Abstract: Neuroinformatics 2010 (2010) 9. Gupta, A., Bug, W., Marenco, L., Qian, X., Condit, C., Rangarajan, A., M¨ uller, H.M., Miller, P.L., Sanders, B., Grethe, J.S., Astakhov, V., Shepherd, G., Sternberg, P.W., Martone, M.E.: Federated access to heterogeneous information resources in the neuroscience information framework (NIF). Neuroinformatics 6(3), 205–217 (2008) 10. Herz, A.V.M., Meier, R., Nawrot, M.P., Schiegel, W., Zito, T.: G-Node: an integrated tool-sharing platform to support cellular and systems neurophysiology in the age of global neuroinformatics. Neural Netw. 21(8), 1070–1075 (2008) 11. Hines, M.L., Davison, A.P., Muller, E.: NEURON and Python. Frontiers in neuroinformatics 3(January), 1 (2009) 12. Ljungquist, B., Petersson, P., Schouenborg, J., Johansson, A.J., Garwicz, M.: A novel framework for storage, analysis and integration through mediation of largescale electrophysiological data. In: 5th International IEEE/EMBS Conference on Neural Engineering (2011) 13. Paradis, E., Claude, J., Strimmer, K.: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20(2), 289–290 (2004) 14. Stoesser, G., Sterk, P., Tuli, M.A., Stoehr, P.J., Cameron, G.N.: The embl nucleotide sequence database. Nucleic Acids Res. 25(1), 7–14 (1997) 15. Teeters, J.L., Harris, K.D., Jarrod Millman, K., Olshausen, B.A., Sommer, F.T.: Data sharing for computational neuroscience. Neuroinformatics 6(1), 47–55 (2008)

Management of Genotyping-Related Documents by Integrated Use of Semantic Tagging Alessio Bechini1 and Riccardo Giannini2 1

Univ. of Pisa, Dept. of Information Engineering, largo Lazzarino 56126 Pisa, Italy [email protected] 2 Univ. of Pisa, Dept. of Surgery, via Paradisa, 2 56124 Pisa, Italy [email protected]

Abstract. A widespread need is present in molecular biology laboratories for software systems to support the internal management of data and documents. A typical case is represented by genotyping procedures, which produce a large amount of documents whose content may represent a potentially important knowledge base. The exploitation of such information requires a proper classiﬁcation of the elements in the knowledge base, and this can be eﬀectively achieved using concepts and tools from research on the Semantic Web. In particular, genotyping-related documents can be handled through a DMS (Document Management System) that is also able to deal with semantic metadata, e.g. in the form of tags. The use of semantic tagging at this operating level is currently hampered by the lack of proper tools. In this paper, based on experience from a practical case, we present an integrated approach to manage relevant genotyping documents and to deal with their semantic tagging. A preliminary study on the test procedures workﬂow is crucial to understand the document production processes. The employed semantic annotation makes use of terms taken from domain ontologies in the biomedical ﬁeld. The annotation tool must be seamlessly integrated in the supporting DMS; the tool ﬂexibility and usability guarantee a low overhead for the annotation process, paving the way for a widespread adoption of semantic tagging for genotyping-related documents. Keywords: Laboratory Information Management Systems, document management, semantic tagging, biomedical ontologies, biomolecular test workﬂow.

1

Introduction

Genotyping procedures used in a biomolecular laboratory produce a large amount of data and documents, and it is diﬃcult to manage such information in a disciplined way [28]. This task can be performed by resorting to Laboratory Information Management Systems (LIMS) [21,28,27,17], which mainly deal with keeping track of samples and with capturing data at diﬀerent steps of the overall procedure. Conversely, document management so far has not deserved speciﬁc attention in this context. The scenario is further complicated by the fact that A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 15–39, 2011. c Springer-Verlag Berlin Heidelberg 2011

16

A. Bechini and R. Giannini

heterogeneous ﬁle formats are employed (often depending on the available instrumentation) and by the possible lab-speciﬁc workﬂow followed to obtain the ﬁnal results. In addition to the basic requirements that data and documents have to be properly collected and that tracing issues have to be guaranteed, it is important to design and implement a data/document repository whose elements could be eﬀectively accessed and searched. It has been shown in recent literature that an organized structuring of genotype-like data is crucial to foster new ﬁndings [18,33], especially in association study analysis; this observation pushes to investigate on better solutions for information classiﬁcation and retrieval. In the ﬁeld of research literature, the central ideas of the Semantic Web [10] have found signiﬁcant application to the management of biomedical publications (and research results in general). The biomedical research community has increasingly experienced the need of a shared precise terminology, and not surprisingly the inﬂuence of Semantic Web principles as contributed to an amazing growth of standardized controlled vocabularies and ontologies [11]. The possibility to formally codify established knowledge into ontologies may also represent a means and an opportunity to relate ﬁndings in a publication to a precise network of grounded concepts. This kind of relation is obtained through semantic annotation. A semantically annotated corpus of publications and, in general, research results, is a massive knowledge base whose full exploitation is likely still far to come [32]. An even more substantial advantage could come from the semantic classiﬁcation of heterogeneous documents from research, or even from ordinary diagnostic tests. In fact, it is reasonable to maintain a semantically searchable knowledge base at the laboratory level, e.g. using a “semantically-enabled” LIMS, at least for the documents produced during test executions [9]. In a wider perspective, a common metadata framework for such a kind of knowledge bases can be deﬁnitely considered as an enabling technology for their exploitation in a coordinated and integrated way. Standardization of access procedures to Document Management Systems (DMS) has recently deserved particular attention from the involved industrial players, e.g. leading to the deﬁnition of the CMIS directives [13]; such eﬀorts are paving the way to the actual federation (at least under the search perspective) of locally developed document bases. Following this approach in the biomedical domain, researchers and clinicians might be given a new extraordinary source of information. Semantic search can be carried out only if the documents in the knowledge base have been properly annotated with unambiguous, standard terms [15]. Standardization eﬀorts in this direction lead to the development of controlled vocabularies as the well-known MeSH [2] in the biomedical ﬁeld. This kind of document annotation (or tagging) turns to be useful if it is performed in a very accurate way, so it requires both time and expertise from the tagging operators. Semantic tagging at this operating level has been deemed unpractical because of the diﬃculties in selecting the correct terms within complex ontologies [11], and also because of the lack of proper tools. Genotyping tests can be considered a paradigmatic case in this setting, because they require diﬀerent instruments

Management of Genotyping-Related Documents

17

through the progression of the test procedures, and diﬀerent types of documents are produced along this path. A general way to deal with such documents may rely on a Document Management System (DMS), which provides speciﬁc functionalities to deal with issues around archiving, classiﬁcation and search. As we are interested in wide-range exploitation of our documents, we suppose that our DMS would be embedded in an Enterprise Content Management system (ECM), which give us multiple additional ways to make use of (and possibly share) our documents. The archiving in the ECM can be regarded as the proper point to introduce annotations. In this work, we propose to use semantic tagging on documents produced throughout ordinary genotyping test procedures; this activity must be integrated in the tools employed to manage such documents. Such an approach can be successful only if the tagging procedure would be practical to the end-user. Thus the supporting software tools have to be handy, and they are asked to be as less intrusive as possible with respect to the ordinary course of the archiving activities. In other words, the supporting tools must be able to smoothly introduce the semantic tagging features in the used ECM. The system described hereafter has been developed to meet this kind of requirements. In this paper, an approach to manage and semantically annotate documents throughout genotyping procedures is presented. After a review of related works, Section 3 is devoted to a formal analysis of the process workﬂow. Section 4 discusses issues in document annotation, considering both the semantic viewpoint and the particular characteristics of the biomedical domain. The description of the proposed approach and the structure of the supporting software framework are given in Section 5, and details on the actual employment in a real-world, typical case can be found in Section 6. Conclusions are drawn in Section 7.

2

Related Works

The daily activity of a biomolecular laboratory strongly needs software systems to capture and handle the generated data and documents, as well as to track samples [21]. Moreover, sometimes also the speciﬁc steps in the overall genotyping activities should be formally described [23]. LIMS have been proposed as a solution for these problems, and they have been built up out of simple tools to make them usable by as many users as possible (even by those with minimal informatic skills). A natural, basic instrument to support data keeping is a DBMS; its adoption in a very simple way has been proposed in early works [27], and later it has been further developed in more organic structures [21]. In other cases, a collection of supporting programs (or macros) has been created [28]. Anyway, there exists a common agreement in stating that a proper software suite is an invaluable help in supporting quality control. Recently, eﬀorts have been spent in developing comprehensive frameworks to be adapted to each particular laboratory setting [20]. The main focus in LIMS has been the support to the laboratory procedures, and minimal emphasis has been put to the management of potential value of the collected documents. Some hints on how to tackle this last issue may come from research on the Semantic Web.

18

A. Bechini and R. Giannini

Fig. 1. A BPMN description of the overall procedure addressed in the work. Each test session is usually started upon a pool of requests, and it operates on the corresponding biological samples. The BPMN gateways (diamond-shaped symbols) refer to quality checks on outcomes.

Founding ideas of Semantic Web have received widespread interest in the biomedical research community. Approaches to data and document management in this environment have been deeply inﬂuenced by the vision of semantic searches [10,15] , and also bio-ontologies have been recognized as a key resource to organize and standardize knowledge [38,11]. Following this mainstream in biomedical data management, comprehensive semantic models have been proposed primarily to tackle interoperability issues [14]. So far, these eﬀorts have produced conceptual models suited to the design of new information management systems, rather than ﬁtting the structure of existing software tools. A typical example of the application of ontologies in a web-based tool for biological investigation is an early work on the “RiboWeb” system [3], dedicated to the study of the ribosome: it used an ontological approach to formally represent diﬀerent features of the experiments that can be performed online, using computational resources over available datasets. The collaborative aspects in this kind of tools immediately come into evidence. It is worth underlying that researchers in molecular biology, and particularly those involved in genotyping technologies, have soon recognized the need to rely on a supporting database, properly structured to ease the automation of data feeding [26,21]. Anyway, methods for data

Management of Genotyping-Related Documents

19

storage have been considered so far as mere side activities respect to the investigation goals [26], and a cross-ﬁeld vision is often missing. To the best of our knowledge, we are not aware of Laboratory Information Management Systems (LIMS) that leverage ontology terms for tagging documents produced along the development of bio-molecular tests, despite the fact that a compelling need for a precise structuring of information from biomedical experiments has been clearly pinpointed and addressed [24]. Anyway, approaches similar to the one shown in this paper have been used in very diﬀerent domains, for example ebXML repositories for enterprise documentation [8]. In this last case, the system has shown to be both user-friendly and eﬀective for document archival and retrieval, although the available domain ontologies were really less abundant. Regarding the development and standardization of both foundation and domain ontologies to be used in biomedical research, a lot of work has been done and organizations exist to take care of deﬁnition and maintenance of bioontologies (in particular, OBO [35] is worth being recalled).

3

Analyzing Genotyping Tests Workflow

The analysis of the actual workﬂow carried out to complete a genotyping test is an integral part of the methodology followed in this work. Such analysis is important because, although the overall procedures are clearly deﬁned, their application can be slightly diﬀerent in each laboratory. In particular, the coexistance of diﬀerent types of test procedures may lead to accommodate the workﬂow in order to speed up the average test completion latency (or throughput). The workﬂow analysis is aimed at identifying the produced documents and data that are required/worth being archived and possibly tagged with standard ontology terms. We have chosen BPMN [30] as the conceptual tool for workﬂow modeling. It is a formal, graphical, and self-evident notation suitable to model general business and production processes [40], and several software applications are currently available to support BPMN-based graphical modeling and simulation. The present work takes a speciﬁc real-world case as a reference, and most of the following models are based on it: our target lab mainly deals with samples of neoplastic tissues, and it is specialized in detecting mutations that aﬀect the choice of the pharmacological treatment. Although the discussion may beneﬁt from the reference to a concrete setting, all the case-speciﬁc, excessively ﬁne-grained details have been removed for the sake of generality. A top-level representation of the workﬂow is shown in Fig. 1. It contains macro-activities or phases, numbered from 0 to 4, which will be further described in each subprocess speciﬁcation, as shown also in Fig. 2-6. Quality checks are spread along the complete workﬂows, and they must be based on information on the progression and on the outcome of the analysis sub-tasks: the only way to obtain such crucial data is via the inspection of the produced documents. Hence, the quality control system is deeply inﬂuenced by the eﬀectiveness of the document management system.

20

A. Bechini and R. Giannini

Fig. 2. The check-in phase. Sections can be prepared outside of the laboratory. Compliant sections are labeled for identiﬁcation in the downstream tasks.

Several variants to the described procedures are possible, and they are mostly due to the nature of the speciﬁc test actually performed and to the instrumentation used for it. Despite of the possible multiple variants, the process outline can be considered general enough to keep its validity for any activity of this kind. In particular, it is able to show how to uncover crucial points where semantic tagging is recommendable. The whole workﬂow (Fig. 1 and Fig. 2) is cyclically repeated to accommodate the continuously incoming requests. It starts with the reception of a test request, which contains information on what particular kind of test and analysis should be performed, along with patient data and usually also with the disease description. The test request comes with the corresponding biological samples, or sometime with material ready to be processed, as shown in Fig. 2. Typically, any single test cycle operates over a pool of multiple samples from several patients (but not necessarily the same pool throughout the entire workﬂow). For this reason, whenever it is possible, a preliminary sample sections buﬀering is implemented. The start of an analysis session is decided by the laboratory supervisor, upon the inspection of the pool of collected requests. This part of the procedure faithfully describes also what happens for investigations related to research issues, although in this case the requests are generated within the laboratory. Deﬁnitively, the workﬂow of a test to be done for research goals does not diﬀer much from the ordinary diagnostic ones. A “core” test session, executed downstream of the check-in phase (Fig. 2), can be thought as composed of four main phases: 1) nucleic acid extraction, 2) ampliﬁcation, 3) sequencing, and ﬁnally 4) reporting.

Management of Genotyping-Related Documents

21

Fig. 3. Representation of the activities in the Nucleic Acid Extraction Phase; two main paths can be followed. The quality check is performed according to the results from a spectrophotometer.

Adjacent phases are separated by quality checks on the results and, in case of unsatisfactory outcomes, the control ﬂow must be redirected backward, in order to possibly recover the procedure. Just to emphasize their role, quality checks are shown both in the overall model (Fig. 1) and in each sub-process (Fig. 2-6). It must be underlined that we take into account two possible technologies to detect DNA sequences/mutations: traditional Sanger sequencing, and pyrosequencing. For this reason, phase 3 has been split in two. Each phase is carried out making use of speciﬁc machinery, and each instrument produces documents, aimed at describing the experiment outcomes and at characterizing the work progression throughout the experiment. The ﬁrst activity in phase 1 (i.e. nucleic acid extraction, see Fig. 3) is the preparation of samples out of the sections previously obtained. Such preliminary activities are completed taking into account both the patient data and the indications in the test request. Only suitable samples can reach the extraction task. Here, samples are labeled and the binding between patients and samples must be recorded. As part of an intermediate quality check, a spectrophotometer is used, and an image with the absorbance spectrum of the sample is collected from the instrument, as well as a textual report of this subtask. The spectrophotometer results are used to assess the concentration and quality of the extracted nucleic acid. Phase 2 (i.e. ampliﬁcation, see Fig. 4) is carried out by means of PCR runs. Depending on the type of sequencing procedure the sample should undergo in the following phase, two alternative paths exist in this sub-process; the use of

22

A. Bechini and R. Giannini

Fig. 4. Diagram for the DNA Ampliﬁcation Phase. Two distinct paths are followed, depending on the speciﬁc procedure to be carried out next (Sanger sequencing or Pyrosequencing).

Real-Time PCR makes it possible to skip an electrophoretic test. Of course, input data for this task describe the DNA segments to be selected (and thus ampliﬁed). This information is crucial, and it depends on the test rationale from the requester: research, diagnosis, decisions on pharmacological treatments, etc. As shown in the upper path of Fig. 4, snapshots of gel electrophoreses are generated and collected to assess the outcome of the PCR activity. Typically several electrophoresis runs are present in the same picture, and then one single document aggregates information on multiple samples. Also in this case, it is important to keep trace of the binding between runs and patients. About the generality of the model of this phase, we must underline that the workﬂow is basically shaped the same way also when we have to work with RNA (in particular with 1-step RTPCR). Moreover, if cDNA has to be dealt with, an additional retrotranscription task has to be inserted. In our case, the sequencing phase is made of two alternative sub-tasks: Sanger sequencing and pyrosequencing. Although the employed technologies and the respective outcomes can be considered quite diﬀerent, the two activities share a common way to proceed, as it is evident in Fig. 5. In the ﬁrst place, the Sanger sequencing requires a preliminary task for the incorporation of ﬂuorescent dideoxinucleotides and a corresponding puriﬁcation, while in pyrosequencing a reaction setup activity is present as well; then, the plate setup follows in both cases. Once again, the binding between samples and patients must be recorded, producing an operating document indicated as sample association matrix. The

Management of Genotyping-Related Documents

23

Fig. 5. The Sequencing Phase groups up two distinct tasks that require diﬀerent instrumentation. Anyway, the arrangement of internal activities is shaped in a similar way.

subsequent sequencing/pyrosequencing run produces documentation on the read sequence/mutation diagrams and on the run progression itself (in Fig. 5 such documents are speciﬁed for Sanger sequencing and pyrosequencing). The subsequent inspection of such results, which can be done in a semi-automatic way (because the human intervention is almost always necessary), leads to the determination of the target sequence/mutation, typically outlined in fasta format in the classical sequencing. The ﬁnal phase (see Fig. 6) is aimed at formulating

24

A. Bechini and R. Giannini

Table 1. List of documents obtained throughout the genotyping procedures. Depending on the speciﬁc document class, a ﬁle can correspond to one or multiple biological samples (and thus to one or multiple patients). Patient-Speciﬁc/

Document class

Phase

Content Type

File Format

Sample ID & labeling

Ph. 0

text

.rtf; .doc

data sheet

.xls; .db

A

image

.jpg; .tiﬀ; .bmp

PS PS

Sample ID & labeling data Ph. 0 Abs. spectrum image

Ph. 1

Aggregated PS

Sample features report

Ph. 1

text

.rtf; .doc

Sample features data

Ph. 1

data sheet

.xls; .db

A

RTime PCR report

Ph. 2

PCR log/data

.rex

PS or A

RTime PCR image

Ph. 2

image

.jpg; .tiﬀ; .bmp

PS or A

EF Run Snapshot

Ph. 2

image

.jpg

PS or A

Sequencing log

Ph. 3

run log

.phd.1

PS

Sequence data

Ph. 3

run data

.seq

PS

Sequence-electr

Ph. 3 electropherogram

Sequence-text

Ph. 3

.ab1

PS

text

.fasta

PS A

Pyroseq log

Ph. 4

run log

(propr.)

Pyroseq data

Ph. 4

run data

(propr.)

A

Pyroseq snapshot

Ph. 4

image

.bmp; .jpg

PS

Pyroseq report

Ph. 4

text

.rtf; .doc

PS

Final report

Ph. 5

text

.rtf; .doc

PS

a precise, concise response from the analysis outcomes. This is achieved via the comparison of the retrieved sequence with the associated known wild type. Such a comparison is aimed at identifying and characterizing possible mutations. A standard report on this ﬁnding, written adopting proper standards for mutation descriptions, is ﬁled into a repository that is also accessible from auxiliary personnel, to possibly produce the paper-printed report for the requestor as well. After a complete description of the workﬂow phases, we can identify which documents (and their characteristics) are produced along the execution of the tests. Table 1 lists the type of such documents (indicated as “Document class”), the origin phase, the type of the contents, the ﬁle format, if the information is related to one or multiple patients/samples, and possible ontologies whose terms could be used to tag them.

4

Management of Genotyping-Related Documents

One of the most signiﬁcant problems with the described workﬂow is that all the produced documents are scattered throughout several diﬀerent, heterogeneous systems aside of the analysis instruments. As a consequence of this kind of physical distribution, it is often diﬃcult to perform cross-platform studies and statistics involving data from multiple phases. An integrated, distributed system for the homogeneous management of the documents involved in the genotyping

Management of Genotyping-Related Documents

25

Fig. 6. In the Report Generation Phase, the ﬁnal test results are presented according to a standard outline

activity would be an ideal solution, and semantic functionalities could be built upon it. Although integrated software suites for collecting such documents currently exist, they are not widespread at present (especially in small and medium-sized laboratories) because of cost and management issues. From a functional perspective, they can be viewed as typical Document Management Systems (DMS): at the core, a repository containing structured and non-structured documents, and an engine working on top of it and providing features like searching and versioning. Typically DMS are integral parts of Enterprise Content Management (ECM) systems, which provide all the functionalities about security and distribution that are usually required in our target environment. Moreover, integration approaches have been also studied for the development of collaborative, distributed environments in other application ﬁelds, and namely e-Business [7]. Therefore, we propose to enhance the organization of this distributed repository by leveraging a technique recently introduced in the document management ﬁeld: semantic tagging. 4.1

Tagging and Semantic Tagging

The assignment of metadata items to a given resource is usually known as annotation. A “resource” may be any kind of information carrier, e.g. a text document, a spreadsheet, a photo, or even an URL. When the involved metadata item is a word, or a short sentence, it is often called tag and the corresponding annotation activity is referred to as tagging. The intended goal of a tagging system is to describe all resources in a knowledge base in a meaningful way, to ease categorization and future retrieval [39]. Modern ECM systems give the possibility to perform tagging upon the collected documents. The implementation of a tagging system can be based on an underlying relational database that keeps trace of the correspondence of document references/IDs and tags. The actual structure of this supporting tool can aﬀect the performance of the tagging system and

26

A. Bechini and R. Giannini

the searching procedure, but not the functional characteristics of the ECM. At present, we just focus on functional aspects, leaving possible treatment of performance issues to future works. In many popular social networks over the Web, tagging is carried out in a collaborative fashion [12], and any user can add tags to any resource. This eventually leads to the emergence of spontaneous bottom-up classiﬁcations, known as “folksonomies” [36]. Apart from this social facet, tags are useful even when assigned by a single individual; they are an alternative to traditional folder-based classiﬁcation, and allow each object to be characterized from multiple viewpoints. However, the categories they introduce are ﬂat, because ordinary tags, unlike folders, have no hierarchical structure. At least, this last statement holds for traditional tagging systems, where users can freely choose the terms to use as tags. Some variants of tagging systems restrict users’ choice within a controlled vocabulary of some sort; moreover, this vocabulary may contain additional information, such as the relationships among terms (above all, subsumption, or “is-a”, which allows the deﬁnition of hierarchies). This enhanced vocabulary is usually formalized in an ontology. An ontology is an engineering artefact that is able to formally represent, within a given knowledge domain, all the interesting entities/concepts, their properties and their relationships. The most popular formalism for expressing ontologies is OWL (Web Ontology Language) [6]. As OWL is an XML grammar, it can be proﬁciently used in diﬀerent software applications/contexts. Semantic tagging refers to precise annotations that can be leveraged in performing a semantic search [39]. Such precise tags can be chosen out of a given ontology, thus obtaining a twofold advantage: ﬁrst, the meaning of the tag is unambiguously deﬁned; second, relationships between terms in the ontology may be exploited in the search phase to improve the result quality, in terms of both recall and precision metrics. E.g., trivially, if the ontology embeds the information that a “trabecular follicular adenocarcinoma” “is-a” “thyroid neoplasm”, we can infer that all queries involving the last term should also consider documents tagged with the more speciﬁc term “trabecular follicular adenocarcinoma”. Ontologies have been thoroughly investigated in recent literature. In particular bio-ontologies, i.e. those related to biomedical domains, are expected to play more and more important roles in IT support to biomedical investigations [32], and to diagnostic/therapeutic assessments as well. The state-of-the-art in the ﬁeld of bio-ontologies is relatively advanced, and semantic search engines are becoming more and more available. These two observations can suggest that now favourable conditions are present for the introduction of Semantic Web techniques in the biomolecular test domain, and thus the integration of semantic tagging tools in LIMS becomes a fundamental, strategical step. 4.2

Relying on Standardized Ontologies in the Biomedical Field

So far, the bioinformatics community has been particularly active in the deﬁnition and usage of ontologies [11,19,38]: standardized and easily interchangeable domain models are a pressing need when handling extremely large knowledge

Management of Genotyping-Related Documents

27

bases across multiple, closely interacting research groups. The incorporation of bio-ontologies in data annotation systems enables the integration of complex scientiﬁc data and supports the consistency of data management activities [11,22]. The OBO initiative (Open Biological and Biomedical Ontology), for example, is a signiﬁcant classiﬁcation eﬀort covering many aspects of the biomedical ﬁeld [35]. It includes various popular projects, most notably the GO (Gene Ontology) [38,5], perhaps the most famous instance of the kind. Started in 1998 with the genomic description of only three organisms, GO now encloses information about genes and gene products of a plethora of animals, plants and microorganisms. The growing usage of GO terms in scientiﬁc publications has been monitored [22], and the results of such research point out the crucial role played by this conceptual tool in biology research. Other signiﬁcant examples of signiﬁcant ontologies in the biomedical ﬁeld are Reactome, for biological pathways, Disease Ontology, for human diseases, BioCyc, for both pathways and genomes, and the recently proposed Sequence Ontology (SO), addressing biological sequences. Biomedical research has pushed the development of various kinds of ontologies, and the presence of concurrent eﬀorts in the ﬁeld has led to multiple ontology formats, deﬁned independently of the OWL standard or others used in information science: e.g., OBO is a widespread textual representation characterized by low redundancy and high readability; BioPAX (Biological Pathway Exchange) is another example of exchange format, restricted to the domain of biological pathways. The systems biology ﬁeld is particularly interested by the proliferation of formats [37]. The standardization of ontology languages has subsequently become a compelling need, in order to make easier information exchange. Although minor diﬀerences may exist across particular formalisms, the possibility to ﬂexibly use biomedical ontologies in a wider context of software tools has driven a widespread convergence towards OWL [4]. Moreover, also well-developed controlled vocabularies, although their structure would not be so rich as that of typical ontologies, have been recast to OWL: an important example is the NCI Thesaurus, often referred to as NCIT. Several issues in using terms from NCIT for annotation purposes have been reported [25], upon a study on the employment of terms from NCIT to cover all the kinds of entities that are fundamental to an ontology of colon carcinoma. A wider discussion on the semantic model for cancer-related terms in NCIT can be found in [34]. 4.3

Applying Tagging to Genotyping Test Procedures

Annotation procedures can be automated in several ways, and often the document contents can be inspected to select possible keywords to be used as metadata. These approaches often are based on sophisticated Natural Language Processing (NLP) techniques that are quite eﬀective. Because of the nature of most of our target documents, such approaches cannot be systematically applied. In fact, the textual content is really limited; Fig. 7 shows an example of the variety of contents of documents from phases 1-3. On the other hand, whenever it is straightforward to extract metadata from a document that is rigidly organized, this operation can be done by using the standard functionalities provided by

28

A. Bechini and R. Giannini

Fig. 7. Some examples of typical contents of documents produced through phases 1-3. The ﬁrst colum corresponds to a sample that undergoes Sanger sequencing, the other is relative to a pyrosequencing path.

the supporting ECM. A tagging system that supports the whole workﬂow for genotyping procedures can prove useful at several levels. Apart from these basic observations, ordinary tagging can be used to guarantee document/data lineage retrieval in an integrated way, with no resort to a side relational system. A basic usage of tags may consist in keeping track of all documents related to the same selected sample. Practically, a one-to-one relation between samples and tags has to be stated, and all documents containing results for a given sample can be associated to the related tag. This approach is intended to introduce a uniform means to bind samples (and instruments’ runs over them) to the initial request and the patient’s data, despite the fact that throughout the whole workﬂow diﬀerent binding rules may be actually used in the laboratory practices. This method can be proﬁciently used in all the workﬂow points where a “binding” document is produced (see Fig. 2-6). In the initial request, document tagging with the patient ID can be complemented with a number of semantic tags that describe the actual state of the patient (e.g. diseases) and/or the related biological samples. Here a recourse to terms from Disease Ontology (DOID), or akin, is recommendable. This speciﬁc annotation is supposed to be properly propagated at the beginning of phase 1 and beyond, wherever information on request and patient is used. It is not a duty of the tagging tool to enforce the usage of speciﬁed ontologies, but

Management of Genotyping-Related Documents

29

instead the laboratory manager is asked to provide internal operating recommendations/guidelines on how to proceed with document annotation. In the ampliﬁcation phase, all the documents related to the PCR activities (both end-point and real-time ones) can be obviously annotated with a formal reference to the sequence of nucleotides under investigation. Natural target ontologies for this purpose are GO [5] and more speciﬁc ones. As usually one single laboratory may be specialized in genotyping tests in restricted domains, also narrow domain ontologies can be proﬁciently used. In principle, the eﬀectiveness of the semantic tagging (and especially semantic search) is not jeopardized by the use of very speciﬁc domain ontologies, as they can be related to more general and known ones. Anyway, this may lead to relevant problems in actually implementing an eﬃcient search system. Another important issue is determined by the richness and complexity of some well-known ontologies like GO or DOID: their size hampers their management in memory and their graphical representation on the screen, and thus an on-the-ﬂy browsing may be both unpractical for the user and challenging for the annotation tool. A feasible approach is to produce proper subsets of the whole, generic ontologies, containing the terms used in the subdomains covered by the laboratory test targets. Such portions have to contain exactly the original terms, but their modest size make them suitable to be easily handled within the annotation tool. The ontology restriction to a sub-domain can be done either manually or programmatically, according to some predeﬁned criteria: in both cases, the new OWL ﬁle with the generated subset will be handled as any other ontology, co-existing with the other loaded ones (the “source” ontology included), and the terms extracted from it will exactly refer to the original concepts. Theoretically, a whole collection of increasingly speciﬁc subsets can be kept together in separate OWL ﬁles. In this way, if a proper term is not found in a restricted ontology, the user can always resort to a superset of it, even if its handling might be more cumbersome. Moreover, sometimes it is also convenient to have recourse to speciﬁc ontologies developed on purpose within the laboratory; in this case, it is advisable to make use of a standard nomenclature. The general observations done so far also apply to the classiﬁcation of the gel electrophoresis snapshots (see Fig. 4), and here special care must be paid to the fact that one single document (image) groups up the outcome from multiple samples. As typically such samples relate to the same sequence, investigated for the same reason, this grouping is not particularly relevant for most of the classiﬁcation terms. The output documents from the DNA sequencing runs (Fig. 5) are the description of the base-speciﬁc intensity proﬁles (in proprietary and/or open formats), and the log of the sequencing activity. Tags here might also be aimed at pointing out possible problems in the interpretation of the intensity proﬁles. Again, the obtained fasta sequence must be precisely related to the studied exact sequence (but this information can be obtained automatically from previous annotations on the samples worked out).

30

A. Bechini and R. Giannini

The ﬁnal diagnosis is suitable to be tagged with standard notations for the mutations possibly found in the sample [29]. Although here no ontological information is apparently embedded, the mutation code can be related (out of the context of the used software framework) to other ontological models associated to the mutation. Even this last kind of annotation can further enable an ontology-assisted search of the collected knowledge base.

5

Management and Classification within an ECM

The proposed solution for the integration of management and annotation tasks for the target documents is based on a standard document management system (DMS), which typically comes as a core component of a larger Enterprise Content Management (ECM) suite. In other words, a properly customized ECM can be used as the backbone of a LIMS for supporting the genotyping process. Currently, ECMs support metadata handling in various fashions, but not semantic tagging, at least in the form described in this paper. This means that the chosen ECM must be enhanced with the semantic tagging functionality, which should complement those already present. In fact, the ordinary free-tag annotation misses a couple of crucial beneﬁts: i) annotation precision/unambiguity, and ii) possibility to exploit the knowledge embedded in the dense network of relations present among ontology terms. Of course, the annotation support must avoid that a free keyword coud be mistaken for an homonymous term out of a formal ontology. Such a semantic tagging support has then been designed and developed for the Alfresco ECM, and further details on it can be found in subsection 5.2. An ECM is also a good candidate to act as the LIMS core component because it can be equipped to be accessible from a heterogeneous set of coworkers, and to easily interoperate with external applications, e.g. through APIs and/or Web Services. We expect that both the quality and the accessibility of information in repositories for genotyping documents would beneﬁt from the employment of a single, solid, and easily-accessible software system. 5.1

Organizing an ECM as a Laboratory Information Management System

Our proposal for a Laboratory Information Management System relies on a properly shaped DMS, whose repository contains all the documents that get created during analyses over samples. To access the repository, we can plainly leverage the DMS user interface. Modern DMSs are web-based applications, accessible (upon authentication) from distributed web clients; we assume that, using web clients from all the computers aside the laboratory instruments, it is possible to upload the output documents directly into the DMS repository. If this operation cannot be done in place, document ﬁles must be moved somehow to computers in the neighbourhood to perform the required uploads. Our prototype is based on Alfresco ECM [1], a popular open source Enterprise Content Management system. The Alfresco framework integrates modules

Management of Genotyping-Related Documents

31

devoted to Document Management and Web Content Management. It is a webbased application that relies on Java technologies and also makes widespread use of JavaScript components. Its modular architecture makes it suitable to be customized and speciﬁcally shaped to accommodate our requirements. In Alfresco, the principal means to organize documents is via spaces, a concept similar to that of folders, but in a web-accessible setting; spaces can be managed by multiple collaborating actors working through diﬀerent web clients. Spaces can be characterized also at a behavioral level and, speciﬁcally, rules can be added to manage content being inserted, edited in, or removed from a space. One of the actions a rule may trigger is the activation of a functionality, or aspect, for documents in the space that satisfy a given condition; typical aspects in Alfresco are known as “versionable”, “taggable” or “classiﬁable”. Spaces can be used in the LIMS to provide a guiding structure to the document management and annotation tasks. The idea is to simply apply the classical folder-based classiﬁcation in the archiving job, to avoid to waste time in attaching tags with “ordinary” values. Thus in the ﬁrst place the operator is guided to perform a folder-based classiﬁcation for the uploaded ﬁle. In other words, the document has to be placed in a speciﬁc folder aimed at collecting all the material which must be annotated according to a precise term. So, a hierarchy of spaces must be deﬁned to receive the uploaded ﬁles; we will call it host hierarchy. It mimics a tree-shaped classiﬁcation for certain criteria. E.g., each ﬁrst-level space may correspond to the anatomical entity the related sample originates from, the second-level to a speciﬁc test requested on the sample, and so on. Automatic procedures can be triggered by the insertion in such spaces so that precise tag values would be attached to the document; this way, even if the ﬁle will be successively moved, this speciﬁc classiﬁcation would not be lost. Practically, the initial folder-based classiﬁcation induces the automatic annotation for the most straightforward terms. After this initial basic step, the user must be able to annotate the document with other multiple tags, and we advocate the opportunity to use semantic ones. Unfortunately Alfresco, in its current basic conﬁguration, supports only ordinary tagging, not the semantic variant. For this reason, we must implement a modiﬁcation to enable ontology management. 5.2

Seamless Integration of Semantic Tagging

The semantic tagging functionality can be introduced in Alfresco in a couple of diﬀerent ways: – modifying the behaviour related to handling a speciﬁc aspect, namely “taggable” – adding a new aspect (say “semantic taggable”) and providing proper handling procedures for it. Regardless of the choice, which just marginally impacts the actual steps to be followed in performing the operation, we have to decide how to specify and to handle the ontologies we want to consider in our simple semantic-aware environment. The ontology contents must be accessible to the framework and, although

32

A. Bechini and R. Giannini

Fig. 8. A graphical layout of the target ontology can be used to help in ﬁnding the most correct term to add. In the reported screenshot, client-side JavaScript code is used to display ontology terms as a force-directed graph.

this information could be even obtained remotely, a local copy of the ﬁle makes it promptly available. A ﬂexible way to deal with ontologies is to assign them a dedicated place (a space) in the DMS repository, where OWL ﬁles can be freely uploaded (or dropped). During this procedure, they are handled just like any other document in the DMS and no speciﬁc action has to be performed. The space where the target documents must be initially uploaded is required to hold the aspect used for the introduction of semantic tags (for the sake of simplicity, we suppose it would be the “taggable” one). We must make sure to add a rule to all the spaces in the host hierarchy such that each incoming document will become “taggable”, i.e. extended with the “taggable” aspect. The rationale for setting this requirement is the practical need to guarantee that tagging would always be enabled for the documents to be classiﬁed. The ECM web user interface heavily relies on JavaScript modules. This structural feature lets us implement a lightweight support to ontology-aware tagging, which is not intrusive for the whole DMS framework because it requires only some modiﬁcations to the behavior of the tagging module. Following this low-impact philosophy, we chose a JavaScript-only approach to both ontology processing and visualization, acting just at the presentation layer of the web application. The starting point for adding a semantic tag is the same as for ordinary tagging, i.e. in the portion of the web interface dedicated to the management of the document properties (known as the “property view” pop-up). This GUI component can be enriched so that, in case of a “taggable” document, a special link (labelled as “Add a tag”) would appear to provide access to a new navigation interface. In the ﬁrst place, the activated graphical procedure lets the user choose a target ontology; this can be naturally accomplished by means of a combo-box that

Management of Genotyping-Related Documents

33

reports the list of OWL ﬁles uploaded into the dedicated repository space for possible target ontologies. Upon such a selection, a graphical representation for the chosen ontology it is shown in the web user interface, and it can be interactively explored for an easy identiﬁcation of interesting concepts. To this aim, we made use of a force-directed graph layout for an animated visualization, where nodes are ontology concepts and links between nodes represent a parent-child relation, as shown in Fig. 8. Interactive navigation is performed by acting upon the nodes that, once clicked, reveal direct children (if present). Moreover, upon clicking the node, the corresponding term is selected as a candidate value and the text on the button for attaching the new tag is accordingly changed. Of course, other representations could be proposed to make the graphical navigation more immediate; in any case, the possibility to show entire graph portions clustered into one single node is fundamental to handle very rich ontologies. Finally, when the user has found the entity/concept of interest, he can add the entity name to the document’s tag set. In order to avoid ambiguities, the resulting tag will include both the name of the entity and a reference to the speciﬁc target ontology. As we will see in the subsequent section, it is particularly useful to keep a list of the most recently used tag values, to ease the completion of the subsequent annotations upon other documents. From our experience we can undoubtedly state that the implementation in Alfresco of the proposed solution is painless from both the development and the deployment perspectives. In practice, only modiﬁcations at the presentation level are suﬃcient, and mostly they just involve JavaScript code. If we decide to add a speciﬁc custom aspect, the new components to be inserted partially span the server-side tiers of the system, but do not interfer with the other modules.

6

Case Study: Genotyping for Pharmacogenetics-Based Therapeutic Recommendations

The employment of an ECM as the core component of a LIMS and the described semantic tagging procedures have been evaluated taking the daily activity of a medium-sized genotyping laboratory as a reference. The core activity of our target laboratory is the execution of diagnostic tests on samples of neoplastic tissues, to uncover mutations that aﬀect therapeutic treatments. The genotyping result is crucial for the oncologist to give pharmacogenetics-based therapeutic recommendations. Every procedure must be carried out with particular care, and stringent quality control practices are applied. According to the proposed approach, all the sample-related documents obtained throughout the genotyping procedures must be tagged with an appropriate, standard description term. Speciﬁcally, semantic annotations have to describe unambiguously the key characteristics of the samples, i.e., the anatomical entity, the histotype of neoplasm, the analyzed gene or genes and the obtained results (mutations nomenclature or description). Additionally, other tags can be adopted to describe the used methods and analytical data for every sample. We can brieﬂy list the main types of annotations required in this context, pinpointing and related ontologies useful in our target case:

34

A. Bechini and R. Giannini

– tumor type by anatomical entity (i.e., possible values from DOID are “malignant neoplasm of lung”, “malignant neoplasm of thyroid”, etc.); – annotation of the tumor histotypes (i.e. “mucinous adenocarcinoma of the rectum”, “non small cell lung cancer”, etc.); – name of the gene of interest (i.e., “BRAF”, “Kras”, “Hras”, “EGFR”, etc.); – speciﬁcation of the analyzed exon/exons (i.e., “BRAF Exon-15”, “EGFR Exon-18”, “Kras Exon-2”, etc.); – speciﬁcation of the mutation/SNP possibly found (i.e., using the Mutation Nomenclature, “BRAF (Ex. 15) c.1799T>A p.V600E”, “Kras (Ex.12) c.35G>T p.G12V”, etc.). We should pinpoint that, as already discussed, several standard ontologies exist to meet the presented requirements, and it is up to the laboratory direction to decide what speciﬁc ones to consider for the internal practices. A simple test was designed to comprehend whether the overall information management organization would be suitable for the target setting, and to quantify the overhead introduced by the application of the semantic tagging procedures with respect to an ordinary workﬂow. The outcomes are summarized in Table 2. The used dataset was pertinent to the main work carried out in the laboratory: the test dataset was made of the whole collection of documents stored in the computers aside of the instruments used in each phase of the genotyping procedure, and generated during more than one year of activity. Diﬀerent document categories have been considered, to understand whether the document type may somehow inﬂuence the results. The host hierarchy was organized according to the involved anatomical entity on the ﬁrst level, and according to the test types classiﬁcation on the underlying ones. To keep their size manageable, the used ontologies were taken from subsets of larger ones, and one of them was developed on purpose. Table 2. Timing characterization of the tagging procedure. The results on the upper part refer to learning operators; on the lower, values refer to the same operators after a brief training period. D1

D2

D3

D4

Upload time (avg, secs)

27

20

25

24

Tagging time (avg per tag, secs)

32

35

28

29

100%

100%

100%

100%

% tags from “most used” list

0%

0%

0%

0%

Upload time (avg, secs)

15

18

17

13

Tagging time (avg per tag, secs)

12

9

14

11

% tags picked on graph

25%

20%

35%

25%

% tags from “most used” list

75%

80%

65%

75%

% tags picked on graph

Legend D1: Abs. spectrum image; D2: Sample Features Report; D3: EF Run Report; D4: Pyroseq. Report

Management of Genotyping-Related Documents

35

The ECM interface was presented to the laboratory personnel, that actually are domain experts. In a ﬁrst phase, it was asked to upload and tag ﬁles belonging to four diﬀerent document classes, randomly picked up out of the dataset. It must be underlined that the inspected uploading operation was supposed to start with the choice of the Alfresco space where to add the document. In this occasion, they have been constrained to always explore the graphical ontology representation, with no resort to the “most used” tag list. The conditions and contraints set in this ﬁrst phase were aimed at obtaining a sensible estimate of the upper bound for the operating times, and the results are shown in the ﬁrst part of Table 2, organized in separate columns according to the involved document type. Later, after a brief training period to get accustomed with the tool, the same people were asked to perform exactly the same operations. This second phase was aimed at obtaining the average timings that can reasonably be expected during the ordinary execution of the genotyping procedures. By inspecting the results in Table 2, no signiﬁcant diﬀerence in timings across document classes becomes evident. By comparing the values obtained in the two phases, we can see that likely the operators are initially much more familiar with the folder-based classiﬁcation, but soon the operation of adding a tag becomes much quicker and more eﬃcient (this might be due to the availability of frequently used tags; in folder-based classiﬁcation, the host hierarchy must always be navigated to select the proper space). Although it cannot be directly ascertained from the values in Table 2, we think that a thoughtful, balanced organization of the host hierarchy is decisive to improve the tagging timings, because a preliminary choice of the destination space implicitly cut oﬀ most of the ontologies as not applicable, and thus the subsequent selection of the target ontology becomes easier and quicker. The high percentage of tag reuse (reported in the last row) is obviously determined by the fact that each laboratory operates on a restricted number of domains; in any case, such values can be reasonably obtained whenever operating over a batch of tests within the same domain. The described experience is necessarily limited, because we can rely only on domain experts within the lab (it makes no sense to ask someone to use a domain ontology whose terms are totally meaningless for him). Anyway, it clearly shows that semantic tagging is not excessively time-consuming, especially if compared with traditional forms of document classiﬁcation and archiving. The combined eﬀect of the host hierarchy and the semantic tagging seems to make the classiﬁcation more eﬀective and eﬃcient. Furthermore, the time spent in tagging is negligible within the overall genotyping test duration (it may take up to 5-6 days), and the overall laboratory throughput should not be aﬀected. On the other hand, the eﬀectiveness of quality control procedures deﬁnitely beneﬁts from a precise classiﬁcation of documents. Tag-based search is a standard functionality supported by Alfresco, and it can be used referring to the attached terms, regardless of the fact that they belong to a certain ontology. This simple approach misses most of the beneﬁts of semantic search, because it does not take into account the knowledge embedded in the referred ontology. Recent works have tried to depict both qualitatively

36

A. Bechini and R. Giannini

and quantitatively the beneﬁts of semantic search [31], especially in speciﬁc domains. In our case, a simple extension of the plain tag-based search towards semantic awareness can be done by modifying the initial set of tag values in a query, exploiting information from an external reasoner on concepts related to the initial terms. At this point, it is diﬃcult to carefully quantify to what extent this way to proceed could be beneﬁcial to the end-user; in general, about the eﬀectiveness of semantic search in terms of document retrieval metrics, it is a common belief that further experimentation would be needed [16].

7

Conclusions

The application of semantic approaches to shape the way biomedical data are handled and exploited can be recognized as a growing trend in the last years. Such an approach can be pushed further, and also integrated in a seamless way with an overall framework for the management of documents produced during the laboratory activities. In this paper, stemming from experience on the development of a prototype system targeted to the management and annotation of genotyping-related documentation, an ECM is proposed as the main component of a semantically-enabled LIMS. Such solution employs a traditional DMS, and the required modiﬁcations can be easily implemented, because they do not interfere with the core system components. From a functional standpoint, it is crucial to leave the user free to choose any possible ontology (both upper- and domain ones) for selecting precise terms for document annotation, and this requirement has been well accommodated. Moreover, ordinary tagging can go along with semantic one, and the ﬂexibility of the tagging tool makes it usable in the context of diﬀerent operating procedures adopted within a laboratory. Nonetheless, once a semantically annotated document base would be available, it can be exploited also with the help of the knowledge embedded in the used ontologies, possibly making use of speciﬁc external reasoning engines. The availability of open source ECMs can represent another key factor for the success of the proposed approach, paired to the fact that experimentation has shown that it does not introduce signiﬁcant overhead in the ordinary test progression workﬂow. We thus believe that the adoption of a well-structured, semantics-enabled document management system like the one described in this paper could eﬀectively contribute to the day-today operations biologists are involved in, building up an ever-growing, valuable, semantically searchable knowledge base. Future developments will try to put in place mechanisms in the ECM to make the internal tagged document base available to external systems, which can resort to eﬀective reasoners to fully exploit the information potential inside the document repository. Such mechanisms are also expected to lay the foundations for a possible federation of semantic document bases from single separated laboratories.

Management of Genotyping-Related Documents

37

Acknowledgments. The main acknoledgement is due to Jacopo Viotto, who collaborated to the ﬁrst version of the described system and to a previous paper on the topic. Alessandro Spagnesi has provided the ﬁrst raw drafts for the BPMN diagrams. Special thanks for their kind collaboration are due to the personnel of the biomolecular analysis laboratory of the div. of Anatomical Pathology IV in Azienda Ospedaliera Pisana, in Pisa, Italy.

References 1. Alfresco website, http://www.alfresco.com 2. MeSH - Medical Subject Headings, http://www.nlm.nih.gov/mesh/ 3. Altman, R.B., Bada, M., Chai, X.J., Carrillo, M.W., Chen, R.O., Abernethy, N.F.: RiboWeb: An ontology-based system for collaborative molecular biology. IEEE Intelligent Systems 14(5), 68–76 (1999) 4. Aranguren, M.E., Bechhofer, S., Lord, P., Sattler, U., Stevens, R.: Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL. BMC Bioinformatics 8, 57 (2007) 5. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, M., Davis, A., Dolinski, K., Dwight, S., Eppig, J.: Gene Ontology: Tool for the uniﬁcation of biology. Nature Genetics 25, 25–29 (2000) 6. Bechhofer, S., van Harmele, F., Hedler, J., et al.: OWL Web Ontology Language reference (2002) 7. Bechini, A., Tomasi, A., Viotto, J.: Collaborative e-business and document management: Integration of legacy DMSs with the ebXML environment. In: Interdisciplinary Aspects of Information Systems Studies, pp. 287–293. Physica-Verlag HD, Heidelberg (2008) 8. Bechini, A., Tomasi, A., Viotto, J.: Enabling ontology-based document classiﬁcation and management in ebXML registries. In: Proceedings of ACM SAC, pp. 1145–1150. ACM, New York (2008) 9. Bechini, A., Viotto, J., Giannini, R.: Smooth introduction of semantic tagging in genotyping procedures. In: Khuri, S., Lhotská, L., Pisanti, N. (eds.) ITBAM 2010. LNCS, vol. 6266, pp. 201–214. Springer, Heidelberg (2010) 10. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientiﬁc American 284(5), 34–43 (2001) 11. Bleke, J.: Bio-ontologies - fast and furious. Nature Biotechnologies 6(22), 773–774 (2004) 12. Bojars, U., Breslin, J.G., Peristeras, V., Tummarello, G., Decker, S.: Interlinking the social web with semantics. IEEE Intelligent Systems 23(3), 29–40 (2008) 13. Choy, D., Brown, A., McVeigh, R., Müller, F.: OASIS Content Management Interoperability Services (CMIS) Version 1.0 (2010) 14. Deus, H.F., Stanislaus, R., Veiga, D.F., Behrens, C., Wistuba, I.I., Minna, J.D., Garner, H.R., Swisher, S.G., Roth, J.A., Correa, A.M., Broom, B., Coombes, K., Chang, A., Vogel, L.H., Almeida, J.S.: A semantic web management model for integrative biomedical informatics. PLoS ONE 3(8), e2946 (2008) 15. Ding, L., Finin, T.W., Joshi, A., Peng, Y., Pan, R., Reddivari, P.: Search on the semantic web. IEEE Computer 38(10), 62–69 (2005) 16. Dong, H., Hussain, F.K., Chang, E.: A survey in semantic search technologies. In: Proc. of DEST 2008, 2nd IEEE Int’l Conf. on Digital Ecosystems and Technologies, pp. 403–408 (2008)

38

A. Bechini and R. Giannini

17. Donofrio, N., Rajagopalon, R., Brown, D.E., Diener, S.E., Windham, D., Nolin, S., Floyd, A., Mitchell, T.K., Galadima, N., Tucker, S., Orbach, M.J., Patel, G., Farman, M.L., Pampanwar, V., Soderlund, C., Lee, Y.-H., Dean, R.A.: ’paclims’: A component LIM system for high-throughput functional genomic analysis. BMC Bioinformatics 6, 94 (2005) 18. Fong, C., Ko, D.C., Wasnick, M., Radey, M., Miller, S.I., Brittnacher, M.J.: Gwas analyzer: integrating genotype, phenotype and public annotation data for genomewide association study analysis. Bioinformatics 26(4), 560–564 (2010) 19. Hadzic, M., Chang, E.: Medical ontologies to support human disease research and control. International Journal of Web and Grid Services 1(2), 139–150 (2005) 20. Huang, Y.W., Arkin, A.P., Chandonia, J.-M.: WIST: toolkit for rapid, customized LIMS development. Bioinformatics 27(3), 437–438 (2011) 21. Jayashree, B., Reddy, P.T., Leeladevi, Y., Crouch, J.H., Mahalakshmi, V., Buhariwalla, H.K., Eshwar, K.E., Mace, E., Folksterma, R., Senthilvel, S., Varshney, R.K., Seetha, K., Rajalakshmi, R., Prasanth, V.P., Chandra, S., Swarupa, L., SriKalyani, P., Hoisington, D.A.: Laboratory information management software for genotyping workﬂows: applications in high throughput crop genotyping. BMC Bioinformatics 7, 383 (2006) 22. Jensen, L.J., Bork, P.: Ontologies in quantitative biology: A basis for comparison, integration, and discovery. PLoS Biology 8(5), e1000374 (2010) 23. Kohl, K., Gremmels, J.: Documentation system for plant transformation service and research. Plant Methods 6(1), 4 (2010) 24. Kothari, C.R., Wilkinson, M.: Structured representation of biomedical experiments: A bottom-up approach. In: Proceedings of Int’l Conf. on Information and Knowledge Engineering (IKE), pp. 199–204. CSREA Press (2008) 25. Kumar, A., Smith, B.: Oncology ontology in the NCI thesaurus. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) AIME 2005. LNCS (LNAI), vol. 3581, pp. 213– 220. Springer, Heidelberg (2005) 26. Le Hellard, S., Ballereau, S.J., Visscher, P.M., Torrance, H.S., Pinson, J., Morris, S.W., Thomson, M.L., Semple, C.A.M., Muir, W.J., Blackwood, D.H.R., Porteous, D.J., Evans, K.L.: SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Research 30(15), e74 (2002) 27. Li, J.-L., Deng, H., Lai, D.-B., Xu, F., Chen, J., Gao, G., Recker, R.R., Deng, H.W.: Toward high-throughput genotyping: Dynamic and automatic software for manipulating large-scale genotype data using ﬂuorescently labeled dinucleotide markers. Genome Res. 11(7), 1304–1314 (2001) 28. Monnier, S., Cox, D.G., Albion, T., Canzian, F.: T.I.M.S: TaqMan Information Management System, tools to organize data ﬂow in a genotyping laboratory. BMC Bioinformatics 6, 246 (2005) 29. Olivier, M., Petitejan, A., Teague, J., Forbes, S., Dunnick, J., der Dunnen, J., Langerod, A., Wilkinson, J., Vihinen, M., Cotton, R., Hainaut, P.: Somatic mutation databases as tools for molecular epidemiology and molecular pathology of cancer: Proposed guidelines for improving data collection, distribution, and integration. Human Mutation 30(3), 275–282 (2009) 30. OMG. BPMN 2.0 speciﬁcations (2009) 31. Price, S.L., Nielsen, M.L., Delcambre, L.M., Vedsted, P., Steinhauer, J.: Using semantic components to search for domain-speciﬁc documents: An evaluation from the system perspective and the user perspective. Information Systems 34(8), 724–752 (2009)

Management of Genotyping-Related Documents

39

32. Rubin, D.L., Shah, N.H., Noy, N.F.: Biomedical ontologies: a functional perspective. Brieﬁngs in Bioinformatics 9(1), 75–90 (2008) 33. Shah, N., Jonquet, C., Chiang, A., Butte, A., Chen, R., Musen, M.: Ontologydriven indexing of public datasets for translational bioinformatics. BMC Bioinformatics 10(suppl.2), S1 (2009) 34. Sioutos, N., de Coronado, S., Haber, M.W., Hartel, F.W., Shaiu, W.L., Wright, L.W.: NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. Journal of Biomedical Informatics 40(1), 30–43 (2007) 35. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J., Consortium, T.O., Leontis, N., RoccaSerra, P., Ruttenberg, A., Sansone, S.-A., Scheuermann, R.H., Shah, N., Whetzel, P.L., Lewis, S.: The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251–1255 (2007) 36. Specia, L., Motta, E.: Integrating Folksonomies with the Semantic Web. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 624–639. Springer, Heidelberg (2007) 37. Strömbäck, L., Hall, D., Lambrix, P.: A review of standards for data exchange within systems biology. Proteomics 7(6), 857–867 (2007) 38. Tanabe, L.K., Wilbur, W.J.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002) 39. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2006) 40. Wohed, P., van der Aalst, W.M.P., Dumas, M., ter Hofstede, A.H.M., Russell, N.: On the suitability of BPMN for business process modelling. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 161–176. Springer, Heidelberg (2006)

MEDCollector: Multisource Epidemic Data Collector Jo˜ ao Zamite1 , Fabr´ıcio A.B. Silva2 , Francisco Couto1 , and M´ ario J. Silva1 1

LaSIGE, Faculty of Science University of Lisbon, Portugal 2 Army Technology Center Information Technology Division Rio de Janeiro, Brazil [email protected]

Abstract. We present a novel approach for epidemic data collection and integration based on the principles of interoperability and modularity. Accurate and timely epidemic models require large, fresh datasets. The World Wide Web, due to its explosion in data availability, represents a valuable source for epidemiological datasets. From an e-science perspective, collected data can be shared across multiple applications to enable the creation of dynamic platforms to extract knowledge from these datasets. Our approach, MEDCollector, addresses this problem by enabling data collection from multiple sources and its upload to the repository of an epidemic research information platform. Enabling the ﬂexible use and conﬁguration of services through workﬂow deﬁnition, MEDCollector is adaptable to multiple Web sources. Identiﬁed disease and location entities are mapped to ontologies, not only guaranteeing the consistency within gathered datasets but also allowing the exploration of relations between the mapped entities. MEDCollector retrieves data from the web and enables its packaging for later use in epidemic modeling tools. Keywords: Epidemic Surveillance, Integration, Workﬂow Design.

1

Data

Collection,

Information

Introduction

The study of epidemic disease propagation and its control is highly dependent on the availability of reliable epidemic data. Epidemic surveillance systems play an important role in this subject, extracting exhaustive information with the purpose of understanding disease propagation and evaluating its impact in public health through epidemic forecasting tools. International organizations, such as the World Health Organization (WHO), have epidemic surveillance systems that collect infectious disease cases. However, although oﬃcial disease statistics and demographics provide the most reliable data, the use of new technologies for epidemic data collection is useful to complement data already obtained from national reporting systems. A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 40–72, 2011. c Springer-Verlag Berlin Heidelberg 2011

MEDCollector: Multisource Epidemic Data Collector

41

In recent years, several projects have researched the use of the Web as a platform for epidemic data collection. The systems developed by these projects gather epidemic data from several types of sources [1], such as query data from search engines [2], Internet news services[3] and directly from users [4]. Alternative sources for epidemic data are social networks, e.g. Twitter [5], which are forums where people share information that can be accessed as Web services. These alternative sources of information can be used to identify possible disease cases, or at least provide a glimpse about the propagation of a disease in a community. The aforementioned systems, based on Web technologies, contribute to the recent increase in available epidemic data. However, this data is neither centralized nor organized in order to be easily found and shared between scientists and health professionals. Furthermore, these systems use diﬀerent protocols for data presentation and transmission and as such there is no uniﬁed way to extract and use all of this available data. Therefore, an integrative eﬀort is required to consolidate this data so it can be used in e-science data-analysis. The Epidemic Marketplace (EM) [6] developed by the EPIWORK project is an information platform for the integration, management and sharing of epidemiological data. EM stores data derived from Internet monitoring systems and is designed to interoperate with a computational modeling platform. It also provides a forum for the discussion of epidemic datasets and epidemic modeling issues. This framework comprised of information and computations platforms will provide tools for data management, epidemiological modeling, forecasting and monitoring (Fig. 1). The EM is composed of several diﬀerent modules which provide the functionalities of the information platform. These include a digital repository of epidemic datasets and other resources and mediator services which enable applications to interact with the digital repository. To tackle the aforementioned necessity of data integration from multiple sources the EM also includes the MEDCollector, a workﬂow design system for the extraction and integration of epidemic data from multiple Web sources (Fig. 2). MEDCollector follows a workﬂow design approach to the extraction and integration of data from multiple heterogeneous Web sources. Workﬂow deﬁnitions enable the declaration of collection mechanisms using web services (Fig. 3). Through workﬂows extraction mechanisms can be ﬂexible, allowing the addition of new functionalities, through services, changing the way these mechanisms behave. Since workﬂow design tools are typically complex and require users to have speciﬁc technical workﬂow knowledge, such as business analysts, there is the need for the creation of abstractions and simpliﬁcations to the process of creating workﬂows. This can be achieved through a simple workﬂow design interface which enable non-technical users to create the necessary data harvesting mechanisms. Following this approach, this paper describes the development of MED Collector, a system for information extraction and integration from multiple

42

J. Zamite et al.

Fig. 1. The main components of the EPIWORK project and how they interact to study epidemics. Figure extracted from http://www.epiwork.eu/

heterogeneous epidemiological data sources using worklows. MEDCollector enables the ﬂexible conﬁguration of epidemic data collection from multiple sources, using interoperable services orchestrated as workﬂows. Collected data can then be packed into datasets and uploaded to EM’s repository for later use by epidemic modeling tools. Through the use of Web standards for data transmission, the system can be seamlessly integrated with external web services to extend its basic functionality. This system gathers and integrates data from multiple and heterogeneous sources, providing epidemiologists a wide array of datasets obtained from the Web using its own data services, in addition to traditional data sources. This paper is an extended version of the conference paper with the same title presented at ITBAM’10 [7]. This paper extends MEDCollector to accomplish the requirements of data provenance and packaging and also includes an evaluation of the system by comparison with other scientiﬁc workﬂow systems. The remainder of the paper is organized as follows: Section 2 provides insight into previous related work; Section 3 is an assessment of the system requirements for an epidemic data collector; Section 4 presents our system’s architecture and a brief description of its implementation; Section 5 compares MEDCollector with two other workﬂow design software systems, Taverna and Kepler; Section 6 presents the conclusions and perspectives for future work in MEDCollector.

MEDCollector: Multisource Epidemic Data Collector

43

Fig. 2. Overview of the “Epidemic Marketplace” and the several modules that compose it

2

Related Work

Over the past few years, with the explosion of the World Wide Web, online data availability has increased signiﬁcantly. With this increase, news systems for browsing, harvesting and retrieving this information have surfaced. These systems have taken advantage not only of the topology of the Web but also from the large user base that it represents. For instance, DROP [8], a system for harvesting and storing online publications, not only includes a crawler module that collects publications actively from the web, but also accepts direct publication submissions to the system. The development of middleware and networking technologies for tasks such as data acquisition, integration, storage, management, mining and visualization enables the use of web, computational and information technologies to achieve scientiﬁc results, or E-science. E-science, through the provision of scientiﬁc environments, allows global collaboration by enabling universal access to knowledge and resources.

44

J. Zamite et al.

Fig. 3. Example of a workﬂow to extract messages from twitter, text-mine them, and insert the extracted information into a database

Taverna, developed by the myGrid team [9] [10] is an open source Workﬂow Management System commonly used in bioinformatics which allows users to design and execute scientiﬁc workﬂows. Taverna enacts workﬂows as a series of connected web services, enabling the creation of in silico experiments with data available on the Web. Kepler [11] is also an open source scientiﬁc workﬂow system. However, unlike Taverna, Kepler’s functionalities are not dependent on web services, but on actors which do not depend on Web Servers. Together with Tavera and Kepler, other initiatives such as EGEE and DEISA [12] are bridging the gap between the need for computation tools and their seamless integration through the use of standards and interoperable software. The use of workﬂow environments enable the creation of mechanisms to transform and structure and integrate data from multiple sources, originally in diﬀerent formats and dealing with multiple concepts, through adequate services which can be orchestrated for that purpose. The Web presents a valuable source for collecting epidemic data, but it requires coping with a variety of formats, ranging from free text to XML documents. Disease reporting services, like the ProMED-mail newsletter [13], EuroFlu and reports from the European Centre for Disease Prevention and Control (ECDC) [14] are useful sources of epidemiological data. The ProMed-mail newsletter, maintained by the International Society for Infectious Diseases, is a notiﬁcation service that sends their registered users information about new disease outbreaks and cases via e-mail. EuroFlu.org, a WHO website, and the ECDC’s European Inﬂuenza Surveillance Network (EISN) [15] publish weekly reports on the activity of Inﬂuenza-like diseases. Internet Monitoring Systems (IMS) can retrieve data using two distinct approaches: passive data collection and active data collection. Systems that

MEDCollector: Multisource Epidemic Data Collector

45

use passive data collection mechanisms, such as Gripenet [4] and Google Flu Trends [2], provide interfaces for their users to voluntarily submit their data. On the other hand, active data collection systems, such as Healthmap [1] and the Global Public Health Intelligence Network (GPHIN) [3], use crawlers that browse the Web through hyper-links and available Application Programming Interfaces (API). Gripenet is a Portuguese IMS which depends directly on the active participation of its voluntary users, which receive weekly newsletters about inﬂuenza activity and are requested to ﬁll out a form about the presence, or not, of inﬂuenza symptoms during the past week. This system was based on Holland’s Inﬂuenzanet [16] model and is currently implemented on seven other countries: Belgium, Italy, Brazil, Mexico, United Kingdom and Australia and Canada. Google Flu Trends is a system that uses search terms from user queries to the Google search engine to estimate inﬂuenza activity each week. Google Flu Trends has been shown to predict inﬂuenza activity within two weeks prior to the oﬃcial sources for the North American population. This system is currently being extended to cover other countries around the world. Both Google Flu Trends and the previously mentioned IMS collect data directly from their users; their diﬀerence lies in that while Google Flu Trends receives data directly from its users this data is not directly related to disease activity. Google Flu Trends correlates a number of search terms with disease activity increase and then uses them to estimate real activity values. Healthmap [1] takes a diﬀerent approach. It is a worldwide epidemic data presentation website that depicts disease cases, mostly of contagious diseases, gathered from diﬀerent sources. These sources can be diverse in nature, ranging from news casting services to oﬃcial epidemic reports, and have diﬀerent degrees of reliability. Disease and location information is extracted via a text processing system and presented on a map via the Google Maps API. GPHIN [3] is a multilingual early warning system for global public health surveillance. It gathers reports of public health signiﬁcance from news-feed aggregators. GPHIN then ﬁlters and categorizes the retrieved reports through an automated process which is complemented by human analysis. An algorithm attributes a relevance score to each report. Reports with a high relevance score are assigned a category while reports with a low relevance score are considered “trash”. Reports with a relevance score between both thresholds are reviewed and classiﬁed through human analysis. To aid in this task, non-English articles are machine translated to English. This system then makes the categorized reports available to users in a variety of languages through machine translation. Using interoperable services which are orchestrated as workﬂows, MEDCollector enables the ﬂexible conﬁguration of epidemic data collection from multiple sources. Collected data can then be packed into datasets for later use by epidemic modelling tools. Through the use of Web standards for data transmission the system can be seamlessly integrated with external Web services to extend its basic functionality.

46

J. Zamite et al.

MEDCollector not only gathers, but also integrates data from multiple and heterogeneous sources providing epidemiologists with not only the traditional data sources, but also a wide array of datasets obtained from the Web using its own data services.

3

Epidemic Data Collector Requirements

An epidemiological data collector should follow a set of principles and requirements that enable extensible data collection and the creation of consistent, integrated datasets, while coping with the heterogeneity associated with its sources. These requirements can be functional, and therefore related to the how the system is used, or non-functional, which relate to the quality of the system. 3.1

Functional Requirements

Active Data Collection. Several Web sources, such as Twitter, currently expose a variety of APIs and URIs. Through the creation of mechanisms to access these APIs it is possible to actively collect their data. This enables data collection from sources like Twitter, Google Flu Trends and EISN reports. Depending on the source, the harvesting mechanism collects an entire message containing the name of a disease for further processing or harvest epidemiological estimates known to be published at the deﬁned source. Passive Data Collection. While using APIs speciﬁcally designed for data extraction is an eﬃcient way to harvest data from several sources, some sources do not provide them and active data collection also creates a number of issues related to the number of requests to external services to keep data updated, such as limitations to the query rate and others. An approach to these problems is the exposure of passive collection interfaces which do not actively search for new data but wait for data to be submitted directly to them. One such example is an e-mail address, monitored by an application, through which the system can receive news and disease alert e-mail subscriptions (e.g. ProMED-mail). Data received by passive data collection mechanisms requires structuring before being integrated and loaded to the system. Local Storage. Diﬀerent data sources have variable data availability times, and data may only be available for some time period at certain sources, if any. An approach to solve the problem associated with dealing with volatile data, as well as the temporal disparity of data sources, is to locally store all the retrieved data in a local dedicated relational database. Ontology Referencing. Enables the use of controlled vocabularies when referencing entities in the spatial and health domains. The use of ontologies enables the disambiguation of named entities, the mapping of entities with multiple references across data sources, and the establishment of hierarchical

MEDCollector: Multisource Epidemic Data Collector

47

relationships between entities. This hierarchy becomes particularly relevant when using geographic referencing. For instance, with the support of a geographic ontology, we can relate cities with their respective countries. This enables the aggregation of data deﬁned for speciﬁc levels to higher levels, e.g. disease cases identiﬁed in Lisbon can be used in Portugal’s domain. Use of Web Standards. Data transmission on the web requires the use of Web Standards, such as XML, enabling machines to parse its contents and process it accordingly. This enables the use of transmitted data independently of the devices or technologies used to access them. Data Provenance. So that data can be validated for use in epidemic studies the epidemiologist must not only be able to know the source of the collected data but also be able to see and evaluate the mechanisms, or workﬂows, which have harvested the data. This way a user can restrict the data he uses in his research according to the harvesting mechanisms he knows to provide the most suitable data for that purpose. Scheduling of Collection Activities. One of the challenges in data collection is the limit on the number of queries that can be performed during a period of time, which is imposed by the Web Source. Coping with this problem should require the data collection system to be able to deﬁne when data collection mechanisms are active on their sources and prioritize which queries are being performed at a given time. This improves the system’s performance by improving the frequency of queries that provide the greatest amount of data while reducing the frequency of queries that provide little to no data. Data Packaging. An epidemic data collection system is only useful to the epidemiologist community if it can be used to provide data for epidemic modelling and forecasting tools. For this purpose the data collection system should enable its users to query the collected data and create consistently packaged datasets. 3.2

Non-functional Requirements

Modularity and Configurability. An epidemic data collector that retrieves data from the Web requires a degree of ﬂexibility in order to cope with changes or additions to its sources. By adopting a Service-Oriented Architecture (SOA) [17], the system has its functionality distributed through discrete units, or services. SOA is implemented on the Web as HTTP-based RESTful Services or based on SOAP and WSDL which are considerably more complex. The use of standards permits unequivocal transmission of information by structuring the data clearly. Workﬂows, or orchestrations as they are called in SOA, enable the design of data ﬂow sequences between the diﬀerent services. Conﬁgurable workﬂows enable the reconﬁguration and addition of new services whenever necessary by

48

J. Zamite et al.

deﬁning how services are interconnected and how information is transmitted between them [18]. The Business Process Execution Language (BPEL) [19] is a XML-based language that describes the interactions between services, therefore allowing the deﬁnition of workﬂows. Each BPEL process is itself a web service. The BPEL process is described in XML and can be interpreted and executed by a BPEL engine. WSBPEL 2.0 is an OASIS standard that enables the deﬁnition of BPEL processes for web services under a SOA architecture. Interoperability. Two types of interoperability are required. Firstly, interoperability between services within the system. By deﬁning services with a set of conﬁgurable inputs and outputs, based on Web Standards, a high level of interoperability is possible, improving the ﬂexibility of workﬂow creation. Secondly, interoperability between the system and other applications, such as epidemic forecasting tools. By enabling this type of interoperability, the system becomes more useful from an e-science perspective by enabling knowledge sharing between diﬀerent systems. Performance. Epidemic surveillance systems need to retrieve data from a large number of sources. Furthermore epidemic data source and news systems provide data more often, sometimes daily, and social networks provide it near “real-time”. An epidemic data collection system should be able to cope with this large amount of data availability. Fault Tolerance. Web Services can often fail due to a number of reasons such as time-outs (due to large numbers of accesses) or request limits. To make a workﬂow reliable in light of these problems it should be able to treat and respond to faults, e.g. treating a service fault resulting from rate-limiting measures and waiting the required amount of time to access the Web service again. Data transmission on the web requires the use of Web Standards, such as XML, enabling machines to parse its contents and process it accordingly. This enables the use of transmitted data independently of the devices or technologies used to access them.

4

Architecture and Implementation

This section describes the design principles behind MEDCollector and its software implementation. MEDCollector implements the requirements detailed in Section 3. It is inspired in the “Data Collector” [20], an initial prototype for the collection of messages from Twitter containing disease and location names. The Architecture of the MEDCollector is represented in Figure 4. The main system components are: – Dashboard. A web-based front end that provides user-interface capabilities to the system, enabling the user to deﬁne workﬂows for data harvesting and packaging.

MEDCollector: Multisource Epidemic Data Collector

49

Fig. 4. MEDCollector’s basic architecture

– Workflow Repository. Stores workﬂows designed through the Dashboard. – Process Builder. Converts designed workﬂows to BPEL processes describing SOA orchestrations. – BPEL Processes. Processes that are run on a BPEL engine and orchestrate communications between multiple services, both internal and external, to perform data collection or data packaging accordingly to workﬂow deﬁnition. – Internal Services. Provide basic system functionalities and interact with the MEDCollector Repository. – External Services. Available on the web, potentially extend the functionality of MEDCollector. – MEDCollector Repository. Stores all the data collected by the system. Users, design data collection mechanisms which are stored in a Workﬂow Repository. These mechanisms are processed into BPEL processes which orchestrate communication between services, both internal and external, performing epidemic data collection. All data collected by these Processes is stored on the MEDCollector Repository through an appropriate service.

50

J. Zamite et al.

This section is further organized in the following sub sections: – 4.1 Data Storage - Describes the logical structure of the MEDCollector Repository. – 4.2 Ontologies - Explains the ontologies used to reference entities in MEDCollector and how they are used. – 4.3 Dashboard - Explores the implementation of the graphical user interface used for the design of workﬂows. – 4.4 BPEL Processes - Describes de creation of a BPEL process from the initial design of the graphical workﬂow. – 4.5 Services - Describes the several services currently implemented in MEDCollector. 4.1

Data Storage

All the data collected by the Multisource Epidemic Data Collector is stored in the MEDCollector repository to implement the requirement of local storage identiﬁed in Section 3. The choice of data sources is reﬂected in the structure and principles of Data Storage. MEDCollector’s target sources diﬀer greatly in the structure of their data as well the information contained in them. It focuses on extracting disease occurrences, incidences, estimates and messages in their geographical context. Since this data diﬀers across sources, its storage’s structure must be able to capture this heterogeneity. An initial prospection of data sources revealed three main groups of possible epidemiological data sources: – Social Network Services, such as Twitter, where people freely share information about themselves including their illnesses. – Epidemiological Surveillance Services, such as EISN, ProMED-Mail, etc., which present very heterogeneous data. Each source has diﬀerent data structures and formats. – New Services, such as Google News, which report RSS feeds and newsletters containing news relating to speciﬁc domains and locations. The MEDCollector Repository stores both the data collected from the Web and data collection schedules. It is implemented as a MySQL relational database. For clarity in the description of this repository’s implementation, we present it as storage for two types of data: Case Data and Scheduling Data. Case Data The collected data is stored in the repository in a relational database. The central entity is a Case and it is related to an entry in each of the following entities: – Disease, containing names of diseases and a concept unique identiﬁer (CUI) that identiﬁes that disease in the Uniﬁed Medical Language System (UMLS) [21].

MEDCollector: Multisource Epidemic Data Collector

51

– Location, containing data about the locations monitored by the system including a geonameid which identiﬁes that location in the GeoNames ontology [22]. – Source, containing the sources used in MEDCollector, referenced by a URL and in some cases the update period of that source. – Process, containing the data about the workﬂow which has generated it. This is relevant for data provenance analysis so that epidemiologists can evaluate which workﬂows provide the best data for their speciﬁc uses. Besides these entities the compound key for a Case also includes the attributes Date and Author. The attributes value, unit and type specify the type of case that was extracted, the value of that case and the unit used to express it e.g. In Twitter the system extracts 1 message, in Google Flu Trends the system can extract disease case estimates such as: 3 estimated cases per 100,000 population. Scheduling Data The schedule of data harvesting queries has an identical organization to the case data with the harvesting events as the central entity and related to the same entities, with the exception of the Process entity (Fig. 5(b) ). Scheduling Data is used by a priority based query selection service (See sub-section 4.5) to deﬁne which queries should be executed ﬁrst at what sources. Each week a background application updates the period values of each disease-location-source triple accordingly to the previous case entries in the last month: – Schedule with a Daily period: every triple with more than 1 entry the previous week. – Schedule with a Weekly period: every triple with more than 1 entry the previous two weeks and 1 or less entries the previous week. – Schedule with a Fortnightly period: every triple with more than 1 entry the previous month and 1 or less entries the previous two weeks. – Schedule with a Monthly period: every triple that does not ﬁt the criteria mentioned above. The disease, location and source entities in the repository are accessible through a series of services, these entities can be selected or inserted. The database currently includes all countries in the world and their capitals as well as a set of 89 infectious diseases. 4.2

Ontologies

The spatial and health-domain entities identiﬁed in the collected data are required to be referenced to a controlled and non-ambiguous vocabulary in order to create a common unequivocal language for all the system users. Ontologies are ideal for this type of problem since they provide a community reference,

52

J. Zamite et al.

(a) Case Data

(b) Scheduling Data Fig. 5. UML class diagram of the MEDCollector Repository

improving knowledge reuse. This provides a common vocabulary through the use of non-ambiguous references, allowing all the users to use the same terms in relation to the same entities, and the establishment of hierarchical relations between entities. This hierarchy becomes particularly relevant when using geographic referencing. For instance, with the support of a geographic ontology, we can relate cities with their countries. This would allow data collected at a particular hierarchical level to be used at superior levels. Therefore, cases identiﬁed by MEDCollector are referenced in the spatial and health domain through the use of ontologies. Unified Medical Language System The Uniﬁed Medical Language System (UMLS) is a collection of controlled vocabularies in the medical sciences which provides a structure for vocabulary

MEDCollector: Multisource Epidemic Data Collector

53

mapping, allowing cross-translation of terms between terminologies [21]. It can also be seen as a multipurpose ontology of biomedical terms. Its purpose is to be used in biomedical informatics applications. UMLS is composed of three Knowledge sources: the Metathesaurus, the Semantic Network and the SPECIALIST Lexicon. The Metathesaurus is a vocabulary database that includes information about biomedical concepts, names and relationships. It is multi-lingual and is built from several terminologies, such as thesauri, classiﬁcations, code sets, lists of controlled terms, biomedical literature etc. The Semantic Network provides categorization of all concepts represented in the Metathesaurus and the SPECIALIST Lexicon is a tool to enhance natural language processing. Disease entities in the MEDCollector Repository are referenced to UMLS through a concept unique identiﬁer (CUI) of a Metathesaurus concept categorized as a “Disease or Syndrome” in the Semantic Network. GeoNames Geonames is a database containing over eight million geographical names corresponding to over 6.5 million unique features [22]. Each of these features is categorized into one of nine classes and one of 645 feature codes. Geonames also includes latitude, longitude, elevation, population, administrative subdivision and postal codes for each of the features represented. Coordinates are represented according to the World Geodetic System 1984 [23]. Each GeoNames feature is represented via a GeonameID and a stable URI which gives access to an RDF description of the feature. The GeoNames Ontology describes feature properties using the Web Ontology Language (OWL). Feature classes and codes are described according to the Simple Knowledge Organization System (SKOS) [24]. GeoNames also provides a RESTful API to access feature properties and relations. Each location in MEDCollector is referenced to a GeonameId. The MEDCollector currently includes all countries and capital cities in the world. 4.3

Dashboard

The collection of epidemic data requires a system with ﬂexibility and conﬁgurability. A recognized approach to this problems is the use of scientiﬁc workﬂows which provide a declarative way of specifying complex mechanisms, or tasks, such as epidemic data collection mechanisms. Simple tasks can be implemented as local services and Web services, being responsible for small fragments of functionality. Scientiﬁc workﬂows enable the use of many such simple components to create complex mechanisms by chaining several such services in a speciﬁc order. This way, a data collection mechanism may be modiﬁed by the addition of new services or simply by adjusting parameters on the currently existing services, enabling the system to cope with the the addition of new data sources or to add further ways to process the collected data.

54

J. Zamite et al.

The Business Process Execution Language (BPEL) [19] is a workﬂow design language that uses XML to describe the interaction between services. The BPEL process, corresponding to a designed workﬂow, is itself a service. One of the diﬃculties with the use of BPEL lies on the need of methods for creating process deﬁnitions by non-technical users [18]. Although graphical notations exist, such as the Business Process Modeling Notation (BPMN) [25], they are complex and require business analysts to design workﬂows. To overcome this issue MEDCollector has a Drag-and-Drop user interface, which enables users to design workﬂows by trial and error (Figure 6). Scientiﬁc workﬂow systems like Taverna were considered, but currently they require the users to provide the WSDLs deﬁnitions of services and many of these systems also restrict communication protocols to SOAP, not allowing RESTful services. In addition, these systems currently do not oﬀer on-browser Web interfaces, requiring users to go through lengthy installation and conﬁguration processes prior to using the software. WireIt[26] enables the deﬁnition of a “Visual Language” that speciﬁes modules, their inputs and outputs, which represent services in MEDCollector. WireIt is an open source JavaScript library for the creation of web wirable interfaces similar to Yahoo! Pipes [27] and uDesign [28]. WireIt uses Yahoo! User Interface library 2.7.0 [29] for DOM [30] and event manipulation and is compatible with most web browsers. It is also bundled with a single-page editor that enables the deﬁnition of workﬂows through a wirable interface.

Fig. 6. Global view of the Web Interface implemented using WiringEditor and description of its components

Using the WireIt library, and based on the bundled single-page editor, a user interface was developed for MEDCollector. This interface is composed of four sections (Figure 6):

MEDCollector: Multisource Epidemic Data Collector

55

– a menu which enables users to save, load and delete workﬂows. – a properties section that enables users to conﬁgure the scheduling parameters for a workﬂow, its name and description. – a modules section which enables the user to select which services to use in a workﬂow. – a workspace area where the user deﬁnes a workﬂow by deﬁning connections between service endpoints, by dragging graphical “wires“ between their inputs and outputs. – an entities section that can be used to browse the disease, location and source entities in MEDCollector and insert new ones. The Dashboard has three tabs. These tabs are responsible for diﬀerent functionalities in the system. The ﬁrst is intended for data collection from the web and storage into a local database, while the second is designed for data packaging, enabling the user to query stored data into downloadable datasets. A third tab which is composed only of a loading menu and a work area enables users to view workﬂows designed by other users while not enabling them to perform alterations. This enables users to select which workﬂows to extract information from, to check the provenance of data by analysing the workﬂows that originated it, or simply to gather design ideas for their own workﬂows.

Fig. 7. A simple workﬂow designed in MEDCollector for the collection of data from Google Flu Trends

4.4

BPEL Processes

Worklows designed in the Dashboard are stores in the workﬂow repository as JavaScript Object Notation (JSON) documents containing the properties of the

56

J. Zamite et al.

workﬂow, a list of used services, or modules, and a list of wirings between these services. Figure 7 presents an example of a workﬂow designed in MEDCollector’s Dashboard, it is stored as the following JSON document: { " m o d u l e s" : [ { " config " : { " p o s i t i o n" : [113 ,63]} , " name " : " M a n u a l Q S " , " value " : { " d i s e a s e" : " I n f l u e n z a" , " l o c a t i o n" : " Spain " , " source " : " G l o o g l e Flu Trends " } }, { " config " : { " p o s i t i o n" : [85 ,286] , " xtype " : " WireIt . I n O u t C o n t a i n e r " } , " name " : " G o o g l e F l u T r e n d s H a r v e s t " , " value " : {} }, { " config " : { " p o s i t i o n" : [269 ,295] , " xtype " : " WireIt . I n O u t C o n t a i n e r " } , " name " : " L o a d T o D B " , " value " : {} } ], " properties":{ " description":"", " name " : " W o r k f l o w N a m e " , " owner " : " U s e r n a m e" , " schedule - time " : " " , " scheduling - type " : " Do Not Run " }, " wires " : [ { " src " : { " m o d u l e I d " :0 , " t e r m i n a l" : " X M L O u t p u t" } , " tgt " : { " m o d u l e I d " :1 , " t e r m i n a l" : " input " } },

MEDCollector: Multisource Epidemic Data Collector

57

{ " src " : { " m o d u l e I d" :1 , " t e r m i n a l" : " output " } , " tgt " : { " m o d u l e I d" :2 , " t e r m i n a l" : " input " } } ] } These JSON documents are parsed by the Process Builder, which interprets this document to create the ﬁles necessary to deploy a BPEL Process. This process consists of a process descriptor, a XML BPEL process deﬁnition and a WSDL document describing it as a service. In the BPEL process, each service is represented as a PartnerLink and the wirings are mapped as a sequence of variable assignments and PartnerLink invocations. BPEL Processes, executed by a BPEL Engine, orchestrate communications between services, to perform data collection accordingly to the workﬂow deﬁnition. These BPEL process can have two functions, data collection or data packaging, according to the dashboard tab they are designed in. A series of Fault handling mechanisms are also added to the BPEL processes to recover from failed invocation of services which can happen for several reasons, including invocation rate limiting measures implemented by several sources. For instance when RESTful Web services return an HTTP code 503, the status code for service unavailable, the process stops and delays its next scheduled execution. Apache ODE (Orchestration Director Engine) [31] is used to execute the BPEL Processes. Apache ODE provides several extensions to standard BPEL engines including XPath 2.0 support, for easier variable assignments, and an HTTP binding extension that enables direct connection to RESTful Web Services. This engine also provides an interface that enables monitorization of currently deployed processes. 4.5

Services

Services represent fragments of functionality, or simple tasks which can be performed by the workﬂow system. These can be information collection services, text mining services, transformation services, scheduler services, and others. All services are currently implemented in PHP Version 5.1.6. Implemented services are divided into three categories: Data Collection Services, Data Packaging Services and Generic Services. Data Collection Services These services are used for the design of data collection workﬂows. They include several tasks, from the selection of which entities to search for, to retrieving the data as well as data transformation and loading data to the MEDCollecto Repository.

58

J. Zamite et al.

Query Selection Services are services that choose the query (disease, location and source triples) to be made by other services. There are two types of query selection services: – Priority Based Query Selection: This service uses the schedule data to prioritize the queries. It selects the period and last search date values, from the Scheduling Data in MEDCollector’s repository, for each disease-location-source triple and outputs the triple with the highest positive priority value according to the formula: priority = date - last search date - period If there are no positive values the service sends a fault message that is caught by the BPEL Engine, stopping it and scheduling another run of the service later. These triples can be ﬁltered by source, location or disease, in order to create processes with speciﬁc scopes, e.g. inﬂuenza in Portugal with Twitter as a source. – Manual Query Selection: This service enables users to manually deﬁne the disease and location to query at a source. Data Harvesting Services are services speciﬁcally designed to gather data through speciﬁc APIs or URLs. The Query Selection Services deﬁne what queries these harvesting services perform at their sources. Currently there are two harvesting services implemented, one to monitor Twitter and one that harvest data from Google Flu Trends. – TwitterHarvest: This service receives a a message from a Query Selection service specifying which location and disease to search for in Twitter. The service uses the Twitter search API to retrieve messages containing the disease name and posted at a the location speciﬁed by the Query Selection service. – GoogleFluTrendsHarvest: This service receives a message from the Query Selection service specifying which locations to search for Inﬂuenza data. Using the country-code for that location it retrieves a text ﬁle containing comma separated values (Through the URL http://www.google.org/ﬂutrends//data.txt). It retrieves the values of estimated inﬂuenza activity from the column containing the location name and its respective date. These services structure and transmit the collected cases using a XML schema compatible with other MEDCollector services. Passive Collection Services receive data posted by disease reporting services and other sources such as e-mail subscriptions. Currently there is a service that checks an e-mail for messages containing names of diseases and locations. This service checks for messages that have not been read and searches them for all locations and diseases in the MEDCollector Repository. If a match is found the message is retrieved and referenced to that location and disease. This service also structures and transmits the collected cases using a XML schema compatible with other MEDCollector services.

MEDCollector: Multisource Epidemic Data Collector

59

Text related services include a regular expression matching service that searches strings for patterns, text mining service and translation services: – Text Mining services receive a message from harvesting and passive collection services and through a rule based approach tries to match sentences of these messages to a number of regular expressions. These regular expressions enable the extraction of case numbers, estimates and deaths. Although this rule-based approach and some of the regular expressions might not be the best approach for Text Mining in this ﬁeld, this service serves as proof-of-concept for the extraction of further information from text messages in a workﬂow oriented data collection system. The development of an appropriate Text Mining algorithm is beyond the scope of this paper. – Translation services use the REST services in the Google Language API [32]. There are two services, one that translates a string and another that given a XML element path translates the text node inside that element. Both these services require the user to specify a desired output language in the two letter code speciﬁed in BCP57 [33]. The use may also specify the language of the original message, however if this parameter is not provided the service will use the Google Language API to try to identify the message’s original language. Database Loading is done through a service that receives a XML message and accordingly performs an insertion in the MEDCollector Repository. Upon insertion the service will return the XML document back to the caller. XML Transformation Services which enable the transformation of XML documents into a schema compatible with other MEDCollector Services. This enables the use of external web services in the sequence ﬂow by transforming their data into compatible data types. There are several services for this task, one that transforms RSS 2.01 feeds and another that transforms Atom 1.0 feeds. These receive the source name and optionally a disease and location. If that disease and location is given, then all the entries in the feed are considered cases of messages pertaining to that disease-location-source triple, otherwise each entry is searched for all disease-location pairs to make that correspondence. Disease and location entities are searched using a binary search algorithm. Another XML Transformation Service that is more complex, and less user-friendly, requires that the user provide the paths to the relevant parts of the document, enabling the transformation of all other XML documents available. This is done using XPath notation. Data Packaging Services Data Packaging services oﬀer the functionality to be used in workﬂows for dataset creation as well as dataset upload to EPIWORK’s information platform, the Epidemic Marketplace.

60

J. Zamite et al.

Data Query Service. This is a service that has a number of optional parameters enabling the user to query the MEDCollector Repository for the data required for dataset creation. Data Formatting Services. By default the queried data is transmitted in the MEDCollector XML schema. Through the use of these services it is possible to convert that XML into JSON and CSV documents. The JSON document keeps the overall hierarchy of the XML document while the CSV transformation service transforms it into a readable two-dimensional matrix. The CSV transformation service also enables the user to specify which delimiters to use for rows and columns. Download Service. This simple service that enables the user to download the resulting document instead of uploading it to the Epidemic Marketplace. EM Upload Service. A service which enables users to upload datasets to the Epidemic Marketplace repository instead of simply downloading them. This service has several parameters which can be ﬁlled out as metadata for the dataset. Generic Services Generic Services are services that can be used in both the Data Collection and Data Packaging workﬂow design and are therefore available on both tabs of the Dashboard. Date Related Services are services that relate to time formats and current dates. The “Date Format” service provides the user with the capability of transforming any date to a speciﬁc format. MEDCollector uses “Y-m-d” by default so this service enables users to create data collection mechanisms that transform the date so it can be used by MEDCollector, or to transform the date in the extracted datasets into the formats used by their applications. The other Date service returns the current date in the provided format (“Y-m-d” by default). This enables the user to specify provide services with the time at which the process is run. For instance if a XML document does not have a time reference this service can be used to provide that reference, likewise this service can be used to provide the Data Querying services with the current time, so that it retrieves data in a period of time related to the date of each run. Formats are speciﬁed in PHP notation1 . XML Manipulation Services provide functionalities such as data structure transformation and access to speciﬁc data elements to improve the ﬂexibility of the workﬂows. These include: 1

http://php.net/manual/en/function.strftime.php

MEDCollector: Multisource Epidemic Data Collector

61

– A GetElement service, which returns the node speciﬁed by a XPath query from a given input XML document. – A Merge and Split gates. The ﬁrst is given two XML document as inputs, it merges them at the Root Level. The second splits a XML document into two by evaluating condition on a node from each entry. This condition can be of the types: “equals“, “contains“, “>=“, “<=“, “>“ or “<“. The Split gate returns a XML document for the entries evaluated as true and another for the entries evaluated as false. External Services can be used to provide additional functionalities to the system. A SOAP and a REST wrapper services are available for this purpose. The SOAP service receives a WSDL URL, an operation name and a set of parameter names and values, which correspond to WSDL message parts. This service sends a SOAP call and returns the response to that call. The REST service receives a URL, a query string and a set of parameter names and values. These are then sent through cURL [34], a command line tool data transfer with URL syntax, and the response is then returned. If the cURL request returns an HTTP code other than 200 it generates a Fault corresponding to that HTTP code to be interpreted by the BPEL Process.

5

Evaluation and Discussion

This section evaluates the system according to the functional and non-functional requirements set on section 3. For this purpose MEDCollector is compared with two known scientiﬁc workﬂow design programs, Taverna [9] and Kepler [11]. MEDCollector is presented as a workﬂow design environment with a set of predeﬁned services that provide its basic functionalities according to the requirements previously mentioned. Taverna, does not provide any services itself, instead it provides the software to access WSDL-based web services. For this reason, Taverna used the services developed for MEDCollector and because of this it presents the same basic functionalities in this evaluation. For this reason Taverna is only compared to MEDCollector on the functional analysis in regard to data provenance. Kepler is a scientiﬁc workﬂow design tool for data intensive processes and presents a large variety of actors that provide data ﬂow functionalities. Kepler is compared with MEDCollector on both the functional and non-functional analysis. For this comparison data harvesting workﬂows were designed to harvest cases from: 1. CDC Flu Updates RSS Feed - A workﬂow retrieves the feed and inserts the items into the database as messages about Inﬂuenza in the United States. 2. CDC Travel Notices RSS Feed - A workﬂow retrieves the feed, searches the items for disease and location entities and if these are found the item is inserted into the database.

62

J. Zamite et al.

3. Google Flu Trends - A workﬂow retrieves Inﬂuenza estimates and inserts them into the database. Spain was used as the location for the data collection in this workﬂow. 4. E-mail Messages - A workﬂow retrieves a message from the e-mail server, searches it for disease and location entities and if they are found it inserts the message into the database. 5.1

Functional Analysis

In this section we compare the functional aspects of MEDCollector with the scientiﬁc workﬂow design software Kepler. Table 1 presents an overview of the functional analysis. Kepler provides a vast array of functions and services, however, since the scope of this paper is epidemic data collection this analysis will only focus on workﬂows designed for that purpose. Table 1. Functional Analysis overview. “Yes” means the functional requirement is fulﬁlled, and “No” means it is not fulﬁlled. “Yes*” means the functional requirement is fulﬁlled but requires more technical knowledge than MEDCollector.

System

Active D.C. MEDCollector Yes Kepler Yes Taverna -

Passive D.C. Yes No -

Local Storage Yes Yes* -

Web Standards Yes Yes* -

Data Scheduling Data Provenance Packaging Yes Yes Yes No Yes Yes* Yes -

Active Data Collection - MEDCollector provides active data collection functionality through a number of services for speciﬁc web sources or by enabling users to use REST or WSDL Web services as data collection sources. Likewise, Kepler enables the user to use service outputs as input of data for its workﬂows through a series of actors that enable the user to access data from a variety of sources, such as RESTful and WSDL web services. Passive Data Collection - MEDCollector enables passive data collection by enabling the extraction of e-mail messages and by enabling users to add collector records through services speciﬁcally designed for this function. Kepler, however, does not provide passive data collection actors that enable the creation of workﬂows for e-mail message collection or other types of passive data collection. Local Storage - Data is stored locally, on the server in a relational database. In MEDCollector, this is done through a database loading service, speciﬁcally designed for this purpose which connects to the database and inserts the data. Kepler also enables the user to connect and query a relational database. However, Kepler requires the user to deﬁne the connection to the database, therefore the user needs to have knowledge of speciﬁc details of the

MEDCollector: Multisource Epidemic Data Collector

63

database server such as its address and user credentials for the database. In Kepler the user also needs to compose the query to be performed on the database, requiring the user to have technical knowledge of SQL (Figure 8). Furthermore, for Kepler to be able to access a database it needs to have speciﬁc endpoints for the Java Database Connectivity (JDBC) driver - e.g. MySQL is required to have the Connector/J plugin. In MEDCollector the user is not required to have technical knowledge of the database server as all operations are performed by the back end web services. Use of Web Standards - The use of Web Standards enables system interoperability and data gathering from multiple sources. Both MEDCollector and Kepler are able to use Web Standards, however their implementations are signiﬁcantly diﬀerent. While MEDCollector uses dedicated services for the transformation of data received from its sources, Kepler is able to access XML document elements through workﬂows that the user designs to access those elements, e.g. by assembling and disassembling XML documents and through arrays returned by XPath query actors (see Fig. 8).

(a) CDC Flu Updates Workﬂow in (b) CDC Flu Updates Workﬂow in MEDCollector. Taverna.

(c) CDC Flu Updates Workﬂow in Kepler. Fig. 8. Workﬂows designed for the extraction of messages from CDC Flu Updates RSS feed. This feed contains only messages relating to Inﬂuenza in the United States.

64

J. Zamite et al.

The issue with this second approach is that it leads to more complex workﬂows, with more actors, shifting the user’s focus from collecting the data to transforming it. After data transformation, MEDCollector services use the same XML schema and character strings for inputs and outputs. Kepler is more complex having several diﬀerent data types, e.g. arrays, booleans, strings, XML Elements, etc. For this reason the user needs to convert data types to create the workﬂows he wants to design depending on the inputs and outputs of each actor. Data Provenance - Each workﬂow in MEDCollector is identiﬁed by a process id. This id is then used to reference each case collected by that workﬂow in the MEDCollector Repository. Using the View Workflows tab on the Dashboard, users can see the workﬂows that were used to collect data. Also, logging ﬁles are kept on Apache ODE at all times so that it is possible to pinpoint any issues in a speciﬁc workﬂow or a faulty service. Taverna provides a detailed view of the workﬂow’s execution in its ”Results” tab. This tab enables the user to see the inputs and outputs of each service as well as its execution times. Currently, Kepler’s provenance framework is still under development and its functionalities are not currently accessible to the general user. Scheduling of Collection Activities - MEDCollector provides this function based on two approaches, scheduling of processes and prioritization of queries. The ﬁrst enables users to deﬁne when a service is actively collecting data from its source with its period of activity speciﬁed in minutes. The latter enables the workﬂows to use queries that provide most messages to be harvested more often than queries that provide no messages. Kepler uses two types of entities, actors and directors. Actors perform speciﬁc functionalities while directors coordinate when actor’s functions are performed. The diﬀerent types of directors deﬁne if workﬂows are run synchronously or asynchronously, the number of iterations, and when they are executed. Data Packaging - MEDCollector provides a Data Packaging tab on the Dashboard which provides a number of services that enable the extraction of detected cases from the MEDCollector Repository. Accessing the local repository is done through a Query service, to which the user provides a number of optional parameters for data ﬁltering and transformation is also provided through a set of services. In Kepler the user is required to write the Query in SQL and designed the workﬂow to deal with all the data transformation as described earlier. This is a complex task which requires the user to have technical knowledge of the database implementation as well as to deal with multiple data types and their transformation.

MEDCollector: Multisource Epidemic Data Collector

5.2

65

Non-functional Analysis

Here we evaluate MEDCollector from a non-functional perspective, comparing it to Kepler and Taverna. We focus on the non-functional aspects that can be better compared between these systems, such as performance, modularity and conﬁgurability. Performance All the workﬂows designed for this evaluation were executed separately. All workﬂows were executed in the same machine, with a 2Ghz CPU and 1 Gigabyte of RAM, on the CentOS 5.5 distribution of the Linux Operating System. MEDCollector Services were hosted on the Apache 2.2.3 Web Server. Apache ODE 1.3.3 was hosted on Apache Tomcat 6.0. Databases were hosted on MySQL Server 5. The server also included PHP 5.1.6 and Java 1.6.0 21. The results in Figures 9 and 10 are average run times over 5 executions for each workﬂow. An initial analysis of the execution times shows an apparent disparity in Kepler’s execution times in relation to those of MEDCollector and Taverna throughout the diﬀerent workﬂows. This disparity is explored below with some insights into the designed workﬂows and their executions. CDC Flu Updates RSS Feed Workflow and Google Flu Trends Workflow In these two workﬂows Kepler performed best with only a fraction of the execution times from MEDCollector and Taverna. MEDCollector was slower performing closely to Taverna. While MEDCollector’s and Taverna’s functionalities are provided by services that require communication through a Web server, Kepler’s actors are Java classes that are directly invoked in its engine. Evidence from these two workﬂows shows that this direct invocation results in lower latency between the invocation of actors and therefore improves the overall performance of workﬂows. CDC Travel Notices RSS Feed Workflow In this workﬂow the Kepler performed worst with MEDCollector and Taverna performing similarly. CDC’s Travel Notices can have several distinct subjects and as such messages have to be searched for evidence of locations and diseases. For this purpose messages are searched for disease and location names in the MEDCollector repository. Both MEDCollector and Taverna use an RSS transformation service which, when not provided with a disease and location name, uses binary search algorithm to search for those entities. Kepler’s actors do not allow the replication of this algorithm, and as such this workﬂow is required to iterate over all disease and location pairs for each message. In this workﬂow we have 89 diseases and 379 locations which results

66

J. Zamite et al.

in 33731 disease-location pairs that need to be searched for each message. This results in an exponentially larger workﬂow run time as messages, disease or locations increase, which explains the dramatically larger execution time in the kepler workﬂow.

(a) Bar chart representing average workﬂow run times for CDC Travel Notices RSS Feed.

(b) Bar chart representing average workﬂow run times for CDC Flu Updates RSS Feed. Fig. 9. Performance charts of the CDC Travel Notices RSS Feed and CDC Flu Updates RSS Feed workﬂows. Average run times are presented in seconds rounded to three decimal places.

MEDCollector: Multisource Epidemic Data Collector

67

(a) Bar chart representing average workﬂow run times for Google Flu Trends.

(b) Bar chart representing average workﬂow run times for e-mail Message retrieval. Fig. 10. Performance charts of the Google Flu Trends and e-mail message workﬂows. Average run times are presented in seconds rounded to three decimal places.

E-mail Messages Workflow While MEDCollector and Taverna use services developed to extract messages from an e-mail server, kepler does not provide such functionality. Due to this, Kepler is excluded from this analysis. Taverna and MEDCollector did not perform signiﬁcantly diﬀerent.

68

J. Zamite et al.

Overall, Taverna and MEDCollector had similar performances throughout the workﬂows. The reason for this lies in that they both use the same services, and both invoke them through a web server. For this reason the latency between service invocation is very similar. MEDCollector’s workﬂow execution engine is Apache ODE which executes BPEL workﬂows and runs on Apache Tomcat. Taverna uses the Scuﬂ Workﬂow Language [9] and its engine runs directly on the Java environment. This diﬀerent back-end implementations explain the small diﬀerences in performance show above. Kepler’s actors are directly invoked as Java classes and functions which results in shorter latency times, improving the general performance of its workﬂows. However, MEDCollector’s services were speciﬁcally designed for epidemic data collection and in some cases they perform better than Kepler. Further more Kepler does not provide the functionalities necessary for passive data collection, as seen in the E-mail Messages Workflow. Modularity and Configurability MEDCollector’s modularity and conﬁgurability are provided by its services and the workﬂow’s scheduling properties. Services were designed for epidemic data collection, performing the functionalities needed by users to achieve this task. This enables users to create workﬂows by focusing on the actions they need to perform to retrieve data from its sources. A service in MEDCollector can perform several actions on the data, e.g. the RSS transformation service in MEDCollector accesses diﬀerent nodes in the XML tree, can search the description of each RSS item for a location and disease, and transform the relevant items into MEDCollector records in another XML schema. Taverna’s functionalities, as well as its modularity and conﬁgurability are dependent on the services it uses. Since it does not come packaged with a set of services that can be used for epidemic data collection the services developed for MEDCollector were used in Taverna, therefore providing it the same characteristics as MEDCollector in terms of modularity and conﬁgurability (see Figure 8 a and b). Kepler is modular and conﬁgurable on a lower level of abstraction. Kepler’s workﬂows are focused on the data, each actor is a single action on that data. To perform the same actions as a single MEDCollector service the user is required to use several actors in Kepler. As an example, to perform the RSS transformation the user would be required to use Xpath query actors to access the XML nodes, database connection and querying actors to retrieve a list of diseases and locations, string matching actors to search for those entities, and boolean control actors to separate the actors that matched the entities from those that did not. Kepler’s actors speciﬁcity enables it to perform a broader number of functions in scientiﬁc workﬂows, however, this excess in modularity for the purpose of epidemic data collection results in much more complex workﬂows which shift the user’s focus from data collection to single data transformation actions (see Figure 8).

MEDCollector: Multisource Epidemic Data Collector

5.3

69

Discussion

Overall all three systems had their strengths and faults. However, MEDCollector fulﬁlled all the requirements identiﬁed in Section 3. The Kepler scientiﬁc workﬂow system is highly focused on data operations and performs best at simple tasks due to its data-driven architecture. However, the task of epidemic data collection from multiple Web sources requires a high level of data treatment and transformation that requires the use of a large number of actors resulting in largely complex workﬂows. Furthermore, several of the actions require that users have knowledge of the technical implementations of servers, as is the case for database servers. Furthermore, Kepler does not provide actors that allow the deﬁnition of passive that collection workﬂows. Taverna is a workﬂow design and execution environment. It is similar to MEDCollector in which its functionalities are provided by Web services. However, since Taverna is a general purpose workﬂow design software it does not provide a basic set of services. As a result Taverna requires users to specify directly what Web services are used through WSDL ﬁles. MEDCollector takes a similar approach, however it provides users with the basic services for epidemic data collection. Some of these services are designed speciﬁcally to simplify the system’s use and improve performance when searching for unknown disease or location entities. With MEDCollector’s services the user can focus on the collection of the data instead of having to focus on the speciﬁcities of complex data transformation tasks as in Kepler. Furthermore MEDCollector’s interface is Web based and as such does not require users to go through lengthy installations and to worry about speciﬁc conﬁgurations for their machines, such as the deﬁnition of proxies. MEDCollector is a workﬂow deﬁnition tool through a graphical interface that was speciﬁcally designed for epidemic data collection. It enables the ﬂexible conﬁguration of data collection mechanisms and is able to perform active and passive data collection.

6

Conclusions and Future Directions

The MEDCollector is implemented as a component for the EPIWORK project information platform - the Epidemic Marketplace. By enabling the collection and integration of data from multiple Web sources, MEDCollector grants epidemiologists with a novel means to gather data for use in epidemic modelling tools. While manual curation, such as the one used by GPHIN, usually provides more accurate data, resulting from direct human analysis, it is expensive and limited in terms of the volume of data it can process. MEDCollector takes a diﬀerent approach from GPHIN. Instead of manually classifying the data it requires users to design automated data collection workﬂows for each speciﬁc source, adapting the extraction mechanism to its nature therefore improving the automatic data extraction from each individual source. These collection workﬂows, retrieve all identiﬁed cases at their source and insert them into the MEDCollector Repository.

70

J. Zamite et al.

The foremost innovation brought by this system is that, through a browser based Dashboard, epidemiologists can design simple Web Service workﬂows dynamically using drag-and-drop components. This enables users to directly create and modify workﬂows to customize data collection mechanisms according to their speciﬁc needs. This dynamic design grants the system ﬂexibility by enabling the addition of further functionality and the alteration of service interactions. While MEDCollector can perform less eﬃciently than Kepler, it enables users with less technical knowledge of the system’s implementation to design workﬂows for epidemic data collection using less workﬂow components. Furthermore MEDCollector’s dashboard enables users to interactively browse the workﬂows developed by other users and extract the data collected by those workﬂows. An interface layer that accommodates the conﬁguration of dataset creation services enables the selection of information from MEDCollector’s Repository and structure it according to the needs of the user, through XML transformation and querying. This transformation enables the creation of aggregated and consistent datasets which can be used by other applications. Through the use of Web Standards for data transmission MEDCollector enables seamless integration of externally supplied Web services, granting extensibility to its basic features. This extensibility enables epidemiologists to design extraction mechanisms that suit their needs in extracting data from multiple Web sources. Furthermore, EPIWORK’s Epidemic Marketplace includes a data repository where datasets can be stored for later use by epidemic modelling tools. Mediator services were developed to enable applications to perform actions on the repository, such as uploading datasets. This mediator enables MEDCollector to submit packaged datasets for storage in the Epidemic Marketplace at regular time periods through the design of data packaging workﬂows, as opposed to simply downloading them. The multisource epidemic data collector will therefore be a useful complementation to national reporting systems and hopefully a contribution to disease propagation studies as well as disease control research. A set of challenges remain to be explored in future work. While MEDCollector enables the collection and packaging of data a set of visualization tools should be developed in order to improve the graphical information given to users upon the creation of datasets. This should provide users with the ability to quickly analyse the data so they can improve their datasets. MEDCollector extracts data about putative infections and references them geographically, furthermore the author of extracted messages is also referenced. This leads to a number of privacy related issues, particularly considering MEDCollector was developed for data sharing across the Web. A number of anonymization techniques should be developed and implemented for epidemic datasets in general and MEDCollector in particular to protect the privacy of the individuals that data represents.

MEDCollector: Multisource Epidemic Data Collector

71

Acknowledgements. The authors want to thank the European Commission for the ﬁnancial support of the EPIWORK project under the Seventh Framework Programme (Grant # 231807), the EPIWORK project partners, CMU-Portugal partnership and FCT (Portuguese research funding agency) for its LaSIGE Multi-annual support.

References 1. Brownstein, J., Freifeld, C.: HealthMap: The development of automated real-time internet surveillance for epidemic intelligence. Euro. Surveill. 12(10), 71129 (2007) 2. Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M., Brilliant, L.: Detecting inﬂuenza epidemics using search engine query data. Nature 457(7232), 1012–1014 (2008) 3. Mawudeku, A., Blench, M.: Global Public Health Intelligence Network (GPHIN). In: 7th Conference of the Association for Machine Translation in the Americas, pp. 8–12 (2006) 4. Van Noort, S., Muehlen, M., Rebelo, A., Koppeschaar, C., Lima, L., Gomes, M.: Gripenet: an internet-based system to monitor inﬂuenza-like illness uniformly across Europe. Euro. Surveill. 12(7), 5 (2007) 5. Twitter, http://www.twitter.com/ (accessed June, 2011) 6. Silva, M.J., Silva, F.A., Lopes, L.F., Couto, F.M.: Building a digital library for epidemic modelling. In: Proceedings of ICDL 2010, The International Conference on Digital Libraries, February 23-27, vol. 1, TERI Press, New Delhi (2010) 7. Zamite, J., Silva, F.A.B., Couto, F., Silva, M.J.: MEDCollector: Multisource epidemic data collector. In: Khuri, S., Lhotsk´ a, L., Pisanti, N. (eds.) ITBAM 2010. LNCS, vol. 6266, pp. 16–30. Springer, Heidelberg (2010), http://dx.doi.org/10.1007/978-3-642-15020-3_2 8. Noronha, N., Campos, J.P., Gomes, D., Silva, M.J., Borbinha, J.L.: A deposit for digital collections. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 200–212. Springer, Heidelberg (2001), http://dx.doi.org/10.1007/3-540-44796-2_18 9. Li, P., Castrillo, J., Velarde, G., Wassink, I., Soiland-Reyes, S., Owen, S., Withers, D., Oinn, T., Pocock, M., Goble, C., Oliver, S., Kell, D.: Performing statistical analyses on quantitative data in taverna workﬂows: an example using r and maxdbrowse to identify diﬀerentially-expressed genes from microarray data. BMC Bioinformatics 9(334) (August 2008) 10. Gibson, A., Gamble, M., Wolstencroft, K., Oinn, T., Goble, C.: The data playground: An intuitive workﬂow speciﬁcation environment. In: IEEE International Conference on e-Science and Grid Computing, pp. 59–68 (2007) 11. Lud¨ ascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientiﬁc workﬂow management and the Kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2006) 12. Riedel, M., Memon, A., Memon, M., Mallmann, D., Streit, A., Wolf, F., Lippert, T., Venturi, V., Andreetto, P., Marzolla, M., Ferraro, A., Ghiselli, A., Hedman, F., Shah, Z.A., Salzemann, J., Da Costa, A., Breton, V., Kasam, V., Hofmann-Apitius, M., Snelling, D., van de Berghe, S., Li, V., Brewer, S., Dunlop, A., De Silva, N.: Improving e-Science with Interoperability of the e-Infrastructures EGEE and DEISA. In: International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, pp. 225–231 (2008)

72

J. Zamite et al.

13. Madoﬀ, L., Yu, V.: ProMED-mail: an early warning system for emerging diseases. Clinical Infectious Diseases 39(2), 227–232 (2004) 14. European Center for Disease Prevention and Control (ECDC), http://www.ecdc.europa.eu/ (accessed June, 2011) 15. European Inﬂuenza Surveillance Network (EISN), http://www.ecdc.europa.eu/en/activities/surveillance/EISN/ (accessed June, 2011) 16. Marquet, R., Bartelds, A., van Noort, S., Koppeschaar, C., Paget, J., Schellevis, F., van der Zee, J.: Internet-based monitoring of inﬂuenza-like illness(ILI) in the general population of the Netherlands during the 2003 - 2004 inﬂuenza season. BMC Public Health 6(1), 242 (2006) 17. Durvasula, S., Guttmann, M., Kumar, A., Lamb, J., Mitchell, T., Oral, B., Pai, Y., Sedlack, T., Sharma, H., Sundaresan, S.: SOA Practitioners Guide, Part 2, SOA Reference Architecture (2006) 18. Garlan, D.: Using service-oriented architectures for socio-cultural analysis, http://acme.able.cs.cmu.edu/pubs/show.php?id=290 19. Alves, A., Arkin, A., Askary, S., Bloch, B., Curbera, F., Goland, Y., Kartha, N., Sterling, K¨ onig, D., Mehta, V., Thatte, S., van der Rijn, D., Yendluri, P., Yiu, A.: Web services business process execution language version 2.0. OASIS Committee Draft (May 2006) 20. Lopes, L.F., Zamite, J., Tavares, B., Couto, F., Silva, F., Silva, M.J.: Automated social network epidemic data collector. In: INForum - Simp´ osio de Inform´ atica (September 2009) 21. Bodenreider, O.: The uniﬁed medical language system (umls): integrating biomedical terminology. Nucl. Acids Res. 32(suppl. 1), D267–D270 (2004), http://dx.doi.org/10.1093/nar/gkh061 22. GeoNames, http://www.geonames.org/ (accessed June, 2011) 23. Decker, B.: World geodetic system 1984 (1986) 24. Miles, A., Matthews, B., Wilson, M., Brickley, D.: SKOS Core: Simple knowledge organisation for the web. DCMI 5, 1–9 25. Business process modeling notation (bpmn) version 1.2, Tech. Rep. (January 2009), http://www.omg.org/spec/BPMN/1.2/PDF 26. Aboauf, E.: WireIt - a Javascript Wiring Library, http://javascript.neyric.com/wireit/ (accessed June, 2011) 27. Yahoo Pipes, http://pipes.yahoo.com/pipes (accessed June, 2011) 28. Sousa, J., Schmerl, B., Poladian, V., Brodsky, A.: uDesign: End-User Design Applied to Monitoring and Control Applications for Smart Spaces. In: Proceedings of the 2008 Working IFIP/IEEE Conference on Software Architecture (2008) 29. Yahoo User Interface Library, http://developer.yahoo.com/yui/ (accessed June, 2011) 30. Le Hors, A., Le H´egaret, P., Wood, L., Nicol, G., Robie, J., Champion, M., Byrne, S.: Document object model (DOM) level 3 core speciﬁcation. W3C Recommendation (2004) 31. T. A. S. Foundation. Apache Orchestration Director Engine, http://ode.apache.org/ (accessed June, 2011) 32. Google AJAX Language API, http://code.google.com/apis/ajaxlanguage/ (accessed June, 2011) 33. Witt, A., Sasaki, F., Teich, E., Calzolari, N., Wittenburg, P.: Uses and usage of language resource-related standards. In: LREC 2008 Workshop (2008) 34. cURL, http://curl.haxx.se/ (accessed June, 2011)

Supporting BioMedical Information Retrieval: The BioTracer Approach Heri Ramampiaro1 and Chen Li2 1

Department of Computer and Information Science Norwegian University of Science and Technology (NTNU) N-7491, Trondheim, Norway [email protected] 2 Dept. of Computer Science University of California, Irvine (UCI) Irvine, CA 92697-3425, USA [email protected]

Abstract. The large amount and diversity of available biomedical information has put a high demand on existing search systems. Such a tool should be able to not only retrieve the sought information, but also ﬁlter out irrelevant documents, while giving the relevant ones the highest ranking. Focusing on biomedical information, this work investigates how to improve the ability for a system to ﬁnd and rank relevant documents. To achieve this goal, we apply a series of information retrieval techniques to search in biomedical information and combine them in an optimal manner. These techniques include extending and using well-established information retrieval (IR) similarity models such as the Vector Space Model (VSM) and BM25 and their underlying scoring schemes. The techniques also allow users to aﬀect the ranking according to their view of relevance. The techniques have been implemented and tested in a proofof-concept prototype called BioTracer, which extends a Java-based open source search engine library. The results from our experiments using the TREC 2004 Genomic Track collection are promising. Our investigation have also revealed that involving the user in the search process will indeed have positive eﬀects on the ranking of search results, and that the approaches used in BioTracer can be used to meet the user’s information needs. Keywords: Biomedical Information Retrieval, Evaluation, BioTracer.

1

Background and Motivation

The continuous increase in the amount of available biomedical information has resulted in a higher demand on biomedical information retrieval (IR) systems. While their use has helped researchers in the ﬁeld to stay updated on recent literature, many of the existing search systems tend to be either too restrictive (returning results with a low recall) or too broad (ﬁnding results with a

This article is a revised and an extended version of the ITBAM 2010 paper [1].

A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 73–94, 2011. c Springer-Verlag Berlin Heidelberg 2011

74

H. Ramampiaro and C. Li

low precision). For this reason, there is a need to improve existing search systems, especially with respect to retrieval performance, in order to improve their precision and recall. 1.1

Challenges for Biomedical Information Retrieval

From the information retrieval point of view, there are many challenges in retrieving biomedical information. First, as in most scientiﬁc domains, there is a wide use of domain-speciﬁc terminology [2]. For example, most of the documents extensively use speciﬁc gene names, names of diseases and/or speciﬁc biomedical speciﬁc terms as part of the text. Most methods should take this into account to be able to successfully retrieve relevant information. As a result, it is challenging to oﬀer a uniﬁed method for preprocessing, indexing, and retrieving biomedical information. Second, the mixture of natural English terms and biomedical-speciﬁc terms can pose problems due to high term ambiguity. A single word can have diﬀerent meanings [3]. This ambiguity may in turn result in challenges for traditional IR methods, such as thesaurus-based extension of queries, as well as identifying relevant documents. For this reason, there is a strong need for word sense disambiguation, which is not easy, and is in itself an area of active research [3]. To illustrate, the term “SOS” normally refers to “urgent appeal for help”, but it could also mean the gene symbol “SOS”, short for “Son of sevenless”1 . Another term illustrating this challenge is “NOT”. It can refer to a protein2 . But in traditional IR, it would be categorized as a stop word and, thus might be ignored in the indexing process. Third, one of the problems with biomedical information is the lack of widely recognized terminology standards. New names and terms are created every time a biologist discovers a new gene or other important biological entities, and there often exist several inconsistent typographical/lexical variants [2]. Authors also normally have their own writing style [4], which can further worsen the situation. Fourth, partly as a result of the existence of several term variants, important words and symbols suitable for indexing have often a low occurrence frequency, and many of terms appear only once in the entire document collection. While this may, in some cases, be useful since the discrimination eﬀectiveness is often inversely proportional with the word’s document frequency, it may also mean a high data sparseness, which would, in turn, have negative eﬀects on the retrieval performance [5]. In summary, current biomedical IR systems have to deal with heterogeneous and inconsistent information. For static document collections, the above characteristic would be less problematic. However, document collections can be very dynamic. As shown in Figure 13 , every day approximately 1000 new citation 1 2 3

See for example http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=retrieve &dopt=default&rn=1&list uids=6654 See for example http://www.uniprot.org/uniprot/P06102 This graph was generated based on the data at http://www.nlm.nih.gov/bsd/licensee/baselinestats.html

Supporting BioMedical Information Retrieval: The BioTracer Approach

75

Fig. 1. Number of MEDLINE citations from 1950 to 2009

records are added to MEDLINE4 . In addition, there is a challenge to measure document relevance to queries. While the notion of relevance based on, for example, the vector space model [6, 5, 7] is well-known and works very well for traditional IR, it needs improvements in biomedical IR, mainly because of the characteristics described above. As a result, the notion of relevance must be examined carefully, knowing that for a speciﬁc query, what is relevant for one biologist may not necessarily be relevant for another. 1.2

Objectives and Contributions

Our goal is to provide a generic biomedical IR system that can be adapted to meet diﬀerent information needs. A basic requirement for such a system is that it must address the fact that users often have their own view of relevance. Moreover, the heterogeneity of biomedical information has made it important to have a customized system. This means a system that (1) allows a user to specify relevant documents from the returned search results through user relevant feedback (URF), and (2) allows us to specify the similarity/language model and the importance of speciﬁc parts in a document. From this point of view, the main objective of this work is to investigate information retrieval methods that provide the best ranking of relevant results. More speciﬁcally, the main contribution of this work is improving existing typically keyword and parametric-based similarity models, such as the BM25 model, 4

MEDLINE is a trade-mark of the U.S. National Library of Medicine (See http://www.nlm.nih.gov/databases/databases medline.html)

76

H. Ramampiaro and C. Li

by enabling boolean queries and wildcard queries, and by applying a querydocument-correlation factor to aﬀect the ranking with respect to the amount of overlap between a query and a document (see also Section 3). We provide an extensive evaluation of our method based on known evaluation principles and a proof-of-concept prototype implementing the method. Although we recognize its usefulness, we show that we can achieve the improvement without using a controlled vocabulary such as the Uniﬁed Medical Language System (UMLS)5 and the Medical Subject Headings (MeSH) [8]. 1.3

Paper Outline

The remainder of this paper is organized as follows. In Section 2 we discuss what text operations can be used in preparation of index terms. In Section 3, we elaborate on our approach to ﬁnd optimal scoring schemes and discuss the eﬀects of applying boolean operations with BM25. In Section 4 we discuss how users can be involved in ranking of search results, through user relevant feedback and a fast, interactive user interface. Then, in Section 5 we present and discuss the results from testing the method and the prototype. Section 6 discusses related work. Finally, in Section 7 we summarize the paper and present future work.

2

Choosing Optimal Text Operations

Applying text operations in information retrieval is a process of choosing candidate words or tokens to index [5, 7]. Tokenization and stemming are examples of such operations. While tokenization is straightforward for general English text, this is not necessarily the case with biomedical information due to the characteristics described in Section 1.1. Existing empirical studies on tokenizing biomedical texts further emphasize this [9, 10]. Both studies suggest applying conditional stemming as part of the normalization process after tokenization. Stemming means that we only index the root or the stem of a speciﬁc term, while at the same time being able to search all grammatical variants of that term – e.g., activate versus variants such activates, activated, and activating. Based on the results from [9, 10], we believe that stemming would work if it is customized to biomedical texts. However, because the beneﬁt versus cost of stemming is often debatable [5,7], we chose to compare the eﬀects of the use against omitting stemming. Our conclusion is that stemming did not give signiﬁcant beneﬁts in terms of retrieval performance. Further, for some types of queries it could change the semantic meaning of a query. For example, a query to ﬁnd all proteins that are activated by a speciﬁc protein could also include a query ﬁnding all proteins that activate the actual protein. This would clearly be a diﬀerent biological process. With conditional stemming we could take such semantic diﬀerences into account. However, this is beyond the scope of this work. 5

See http://www.nlm.nih.gov/research/umls/index.html

Supporting BioMedical Information Retrieval: The BioTracer Approach

77

So instead of stemming, we considered the use of wildcard queries. With a wildcard query, we can choose a speciﬁc query term as a preﬁx. Then, we extend the original query with all terms in the index that have the same preﬁx. For example, a query with ferroportin* will ﬁnd articles with ferroportin1 or ferroportin-1. This approach gives the users the freedom to choose when to extend the queries with the term variants. Our experiments also show its usefulness in terms of better retrieval performance. We also investigated the use of stop-word removal as part of the text operations. This process removes words or terms that do not contribute to discriminating documents. Our experiments show that by carefully choosing candidate words in this process, we can achieve good results at the same time reduce the size of the index.

3

Finding Optimal Scoring Scheme towards Optimal Document Rankings

To further address some of the challenges in Section 1, we need to develop a good ranking method. Existing scoring schemes are good for general purposes, but they need improvements to meet our needs in biomedical information. When developing our method and prototype, we investigated several diﬀerent schemes, and used them as a starting point towards a more improved scheme. In addition to designating suitable ranking models, we also try to ﬁnd out how we can inﬂuence the ranking by boosting important parts in the document and constraining documents based on their query matching degree. 3.1

Choosing Ranking Models

Several ranking or similarity models have been proposed over the years. One of the earliest, proposed by Salton and Buckley [6], was the Vector Space Model (VSM). It is a similarity model based on cosine similarity between a speciﬁc query and the documents in the collection. Its main idea is that for all documents in the collection that contain the terms in the query, a ranking is produced by measuring the cosine of the angle between the query and the document vectors. For general text, it has been shown that this model gives good retrieval performance [5, 11]. Because VSM is one of the most used models, we applied it as a baseline in our experiments. In addition, we implemented an extension based on Lucene6 [12] . Another similarity model that has been increasingly popular in information retrieval is the Okapi BM25 [13]. In contrast to the VSM, BM25 is a probabilitybased approach. As with VSM, it has been shown to be eﬀective on general text retrieval [7]. This is also partly why we decided to implement it in this work. Moreover, we want to know how it performs when used in a highly scientiﬁc domain such as the biomedical domain. We ﬁrst implemented the baseline model 6

See http://lucene.apache.org/java/docs

78

H. Ramampiaro and C. Li

as an extension to Lucene. Then, as shown below, we extended it to take into account that parts of the documents have diﬀerent boosts, which will, as a result, also aﬀect the way the search results are ranked. Furthermore, in order to compare their performance in the biomedical domain, we implemented Lemur TF-IDF [14] and Divergence From Randomness BM25 (DFR-BM25) [15]. These are models developed as an alternative to the Okapi BM25 model [15, 14, 13]. Lemur TF-IDF is a variant of TF-IDF using a TF component that is similar to the one used in BM25. The Divergence From Randomness (DFR) BM25, on the other hand, is a probabilistic model in the same category as BM25. However, while BM25 uses some parameters to experimentally tune the model, DFR BM25 is non-parametric. As can be derived from the name, it assumes that the distribution of words in documents is a random process and that documents can be ranked by the probability that the actual term distribution found in a document would appear by chance. Table 1. Summary of considered ranking models Models VSM Okapi BM25 Lemur TF DFR BM25 Extended VSM Extended BM25

Description A vector space model based on TF-IDF weighting scheme. A parametric probabilistic model using TF, IDF, and document length. A variant of TF-IDF that uses the same TF parameters as in BM25. A non-parametric model using term probability distributions as a base for the result ranking. Extending the standard VSM model with query overlap factors, boolean operators, and wildcard queries. Extending the standard BM25 model with query overlap factors, boolean operators, and wildcard queries.

In addition to the above models, we have considered using language models (LM) for information retrieval [16, 11]. The basic idea of the language model is similar to the probabilistic model. However, instead of modelling the probability for a document to be relevant as with a probabilistic model, in a language model we build a probabilistic model for each document in our collection and rank our documents with respect to the probability of the model to generate the query. For this reason, this model would ﬁt well with our goals on ranking of documents based on the user’ s interactions. Studies have also shown that the LM competes well with the other models with respect to retrieval performance [16, 11]. Nevertheless, due to time constraints its use and integration with BioTracer have been left for future studies. 3.2

Boosting Specific Document Parts

Boosting speciﬁc parts in a document is one way to specify how important one part in a document is compared to other parts. By deﬁning this importance,

Supporting BioMedical Information Retrieval: The BioTracer Approach

79

we implicitly specify how the documents in the search results should be ranked. A document part can be (1) a structural part such as the title or the abstract, or (2) speciﬁc sentences or terms. Previous studies have shown that boosting structural parts of documents can improve the result quality [17, 18]. Structure-Based Boosting: By default, we perform search on both the title and the abstract ﬁelds of a document. These ﬁelds can be given boosts during index time or search time. A title can describe the content of a document well. Thus terms in a title can readily be assumed important, and given a higher weight than those in other places in the document. For instance, our experiments show that weighting the title ﬁeld twice as much as abstract yields best ranking. Another advantage of this approach is that by having diﬀerent weights for these ﬁelds, we also account for the fact that many MEDLINE documents may only contain a title and no abstract. With the same weight, documents without an abstract would automatically be ranked lower. Sentence and Term-Based Boosting: We also investigated the eﬀect of boosting sentences and terms during the index time. This has several advantages. First, it allows us to systematically specify which sentences or terms are to be weighted higher or lower than others. Second, specifying higher boosts or weights based on the importance of speciﬁc terms and sentences can help optimizing the ranking of retrieved documents. The idea is to use techniques from the text mining and natural language processing (NLP) ﬁeld, including named-entity recognition (NER) [19, 20] to identify special sentences and words. Such sentences could be those containing biological-speciﬁc names such as drug, gene and/or protein names, or biological processes such as protein-to-protein interaction and DNA evolution. The idea of identifying speciﬁc words or entity in texts is still an active research area [19, 20,21] because of the diversity in biomedical text as discussed in Section 1.1. To our best knowledge, there has not been much work that has addressed boosting sentences and words as part of a ranking strategy. By successfully identifying biomedical speciﬁc words and sentences we can weight these higher than other general words and sentences. These weights will, in turn, contribute to ranking speciﬁc documents higher than other documents. Our preliminary experiments in this area have shown promising results [22]. It is still in an early stage and for this reason, more work and experiments are needed to make it an integrated part of the current work. Query-Time Term Boosting: In addition to index time boosting, speciﬁc terms can be given higher boosts than other terms during search time. The main advantage of this method is that terms that a user considers to be more important can be given an additional weight, in addition to their statistical information, such as term frequency (tf ) and document frequency (df ). As an example, assume we want to ﬁnd all articles describing the function of the protein ferroportin1. We could compose a query as ferroportin1ˆ2.0 protein

80

H. Ramampiaro and C. Li

function. This query tells the search engine that we would like to weight ferroportin1 as twice as much as the words protein and function. 3.3

Constraining Documents Based on Their Query Matching Degree

Computing term weights: In general, a document can be given a score based on its similarity to the query: wij · wiq . (1) Score(q, dj ) = sim(q, dj ) = i∈q

→ − Here dj = (w1j , w2j , . . . , wsj ), wij ≥ 0 is the vector of term weights for a doc→ ument dj , and − q = (w1q , w2q , . . . , wsq ), wiq ≥ 0 is a query q, where s is the number of terms in the index. Now, assume dj consists of n ﬁelds fkj , k = 1, 2, . . . , n. Taking the boosting values into account, we have n

−∗ → −→ dj = βk · fkj ,

(2)

k=1

where βk denotes the boosting value for ﬁeld fkj . A special case can be deﬁned as →∗ − −→ −→ dj = 2.0 · f1j + 1.0 · f2j , where f1j and f2j are the title and abstract ﬁelds, respectively. Using the Okapi BM25 model [13], we assign each term in dj and q the following weights wij and wiq , respectively: wij = log

N − dfi + 0.5 dfi + 0.5

·

(k3 + 1)tfiq (k1 + 1)tfij and wiq = , K + tfij k3 + tfiq

(3)

where K = k1 ((1 − b) + b(Ldj /Lavg )), N is the number of documents in the collection, dfi is the document frequency – i.e., the number documents in the collection that a term ti occurs in, tfij is the frequency of a term ti within dj , Ldj is the length of dj – i.e., the number of terms in dj , Lavg is the average document length, while k1 , k3 and b are tuning constants with default values 1.2, 2.0. 0.75, respectively7 . Now, let wikj be the weight of a term in fij of d∗j . Then, applying weight on Eq. 3, we have [18]: N − dft + 0.5 (k1 + 1)βk tfij wikj = log , · (4) dft + 0.5 K + βk tfij where K = k1 ((1 − b) + b(Ldj /Lavg )). 7

The values were set in accordance with the recommendation by Robertson and Jones [13].

Supporting BioMedical Information Retrieval: The BioTracer Approach

81

Query-Document Correlation Factor: In the attempt to improve the ranking of relevant documents, we have added a new factor based on the degree of query matching. The idea is to add a scoring factor based on how well a document matches a query. We call this constraining documents based on their query matching degree, or simply query-document correlation factor. Such a constraining process is used in the literature. For instance, Lucene allows a form of document constraining too [12]. In our approach, however, we apply a slightly modiﬁed factor. In addition, we combine it with the similarity model BM25 [13]. To our best knowledge, using the idea of document query correlation factor with BM25 is new. Note that this factor does not directly help to increase the overall retrieval precision or recall, but it is meant to inﬂuence the ranking of already retrieved results. Its main goal is to maximize the number of relevant documents in the top-k hits of the search results. Inspired by the scoring scheme in Lucene [12], we extended the original BM25 scoring equation by adding a document-query correlation factor, called Γ (q, dj ), for a query q and a document dj . This factor is calculated based on how many terms in q are found in dj , and is proportional to the overlap between q and dj . In our approach, Γ (q, dj ) is computed as follows: Γ (q, dj ) =

nover (q, dj ) maxl nover (q, dl )

n

θ =

nover (q, fij ) maxl nover (q, dl ) i=1

θ (5)

Here θ is a natural number, nover (q, dj ) is the number of terms in q that overlaps with document dj . For all documents in the result set, maxl nover (q, dl ) is the maximum nover for the retrieved documents. Formally, nover (q, dj ) is deﬁned as follows. Assume that ti is a term index, i = 0, 1, . . . , s, and s is the total number of terms in the index, i.e., S = {t0 , t1 , . . . , ts } is the set of all distinct terms in the a document dj index. The overlap between and q is deﬁned as ∀t ∈ S | t ∈ dj q, nover (q, dj ) = |dj q|. We studied the eﬀect of the factor Γ on the retrieval results and the overall evaluation result. We observed that with a small value of θ, the factor is too dominating. Thus it has unwanted eﬀects on the ranking – i.e., relevant documents come too late in the result lists. Similarly, with a big value, θ will be too restrictive and the factor will have bad inﬂuence on the ranking. Therefore, the value of θ has to be chosen carefully. After several experiments, we found that θ = 4 gave optimal results. By combining Eq. 2, Eq. 4, and the extension of the similarity score from Eq. 1, we have: wikj · wiq . (6) Score(q, dj ) = Γ (q, dj ) · sim(q, d∗j ) = Γ (q, dj ) · ∀i | ti ∈q

3.4

Applying Boolean Queries with BM25

As for the Vector Space Model [6], BM25 was originally designed for keywordbased queries [13]. However, to be able to restrict the search results and ﬁlter out unwanted hits, we wanted to investigate the eﬀect of using boolean operations in

82

H. Ramampiaro and C. Li

combination with BM25. Therefore, we implemented our prototype to allow users to search with boolean operations including AND, OR and NOT in combination with BM25. The boolean query model was used in many search engines [5,7]. However, the original model has been seen to be too limited for several reasons. First, it does not rank retrieved results based on relevance to the query, and it only retrieves results that exactly match the query. To address this problem, we investigated the eﬀects of combining boolean queries with BM25. The beneﬁt is that, we can ﬁlter the search results based on the boolean operations in the query, and at the same time rank them based on the BM25 model. Second, a general criticism against boolean query usage is that users might be unwilling or even lacking the ability to compose this type of queries. While we recognize that this concern is valid, we still believe that a system has to provide the possibility of processing boolean queries. In fact, a PUBMED query log [23] with approximately 3 million queries shows that 36.5% of the queries were boolean queries. Although many of these queries might be computer generated (e.g. by automatic query extension), it is a strong enough motivation for investigating the eﬀects of boolean queries. Moreover, although boolean queries do pose some challenges when it comes to ﬁnding all relevant documents from the collections, our experiments (see also Section 5.2) show that we can improve the average top-100 precision by 17% compared to the baseline BM25. This result means that boolean queries combined with BM25 can have good eﬀect on the search precision.

4 4.1

Other Features User Relevant Feedback (URF)

As suggested in our previous discussion, there are many ways to involve a user in inﬂuencing the ranking of documents. Involving the user through relevant feedback has been studied within traditional and multimedia information retrieval for decades. However, within the biomedical domain it seems still missing. A challenge here is to choose the right strategy. In our work, we have chosen to investigate the use of URF based on the scoring scheme suggested by Robertson and Sparck Jones [13]. In addition, we apply the extended BM25 as described earlier. That is, we also extended the original model with the document-query similarity factor Γ (q, d). The scoring scheme equation is similar to the one in Eq. 6, but now the term weight includes the relevance information added by the users. Based on the ideas of Robertson and Sparck Jones [13] and Eq. 4, this weight can be expressed by (r + 0.5)(N − n − R + r + 0.5)) (k1 + 1)βk tfij , wikj = log · (7) ((n − r + 0.5)(R − r + 0.5)) K + βk tfij where K = k1 ((1 − b) + b(Ld /Lavg )), r is the number of retrieved relevant documents, and R is the number of relevant documents in the collection for a speciﬁc query.

Supporting BioMedical Information Retrieval: The BioTracer Approach

83

Here, the value of R can be derived from users’ click-through information in the query log. Click-through information may work as relevance judgement information in that we can assume that all documents that a user has clicked are relevant to a speciﬁc query [7]. This means that we can estimate the number of relevant documents based on the number of documents that are known relevant to the user. Although R may in this case be much smaller than the real total number of relevant documents in the collection, this estimation can still be reasonable [13], presuming that these R relevant documents are within a subset of the set of all relevant documents. To be able to evaluate this approach, we simulated the user relevant feedback process using the relevance information from the test collection. 4.2

Interactive Autocompletion

The primary goal of a search system is to provide best possible retrieval performance in terms of good search results such as high recall and precision values. However, we argue that providing a tool that is able to provide results within a reasonable time frame is crucial. Users often require an interactive system when performing their searches. For this reason, keeping the search time as low as possible should be an important goal for every retrieval system. As a result, we studied how to make our system interactive. Our system has the following features. First, BioTracer provides a responsive and interactive autocompletion, which is activated by each keystroke. This feature can help users compose their queries. Furthermore, as opposed to autocompletion in existing systems, autocompletion in BioTracer not only provides possible preﬁx-based suggestions, but also is fault-tolerant. This means that it can ﬁnd answers even if the preﬁx in a query has spelling mistakes. Figure 2 illustrates how this feature is implemented in BioTracer. These features are useful since quite often a user only vaguely knows what she/he is looking for. Thus by allowing interactive browsing, we can allow a user to decide which suggested keywords should be the correct ones. To enable this feature, we implemented the system as follows (see also Figure 3). Index term extraction. We use the index term dictionary to build the autocompletion index. To do this, we have to ﬁrst extract all unique terms and their corresponding weights from the document collection index. Building the index. To enable keystroke autocompletion, we use an n-gram index strategy. This strategy allows us to do matching against term preﬁxes. The maximal gram length is 20. As an example, let protein be a term that we want to index and assume that we use 2-grams. Then, our gram index would be: p, pr, ro, ot, te, ei, in, in addition to protein. Prefix and fuzzy queries. The use of n-gram indexes enables preﬁx queries and allows us to retrieve terms that share their preﬁxes with input sequences of characters. In addition to allowing higher fault tolerance, we also allow fuzzy queries. This feature allows us to retrieve terms that are similar to the typed one. The closeness is measured using the Levingston distance (also called edit distance).

84

H. Ramampiaro and C. Li

1

2 New Search

Fig. 2. Illustration of the autocompletion feature

Ranking. Once retrieved, the terms are ranked based on several factors. We mainly rank them based on their term weights stored in the main index. As a user types, words with most similar preﬁxes are readily retrieved ﬁrst. Then, for terms that do not share preﬁxes, we use edit distance to order

Supporting BioMedical Information Retrieval: The BioTracer Approach

Extract Index Terms and Term Frequency

Enable Arrow-based browsing

Re-Index Terms using n-gram index

Enable Fuzzy Query

Enable Prefix Query

85

Load the Whole Index into Memory

Rank Suggestions Based on Term Weights

Submit Query on Keystroke

Fig. 3. Summary of the approach to achieve interactive autocompletion

terms that are related as speciﬁed by the fuzzy query. Term weights are the TF-IDF weight when we use the vector space model (VSM), while we use the weight in Eq. 4 for our BM25-based ranking model.

5

Evaluation

5.1

Prototype Implementation

In order to test our ideas, we implemented a proof-of-the-concept prototype, called BioTracer. Instead of re-inventing the wheel, we decided to use existing open source libraries to implement the prototype. Because of the requirements for performance, extensibility, simplicity, as well as scalability, we chose Java Lucene. Figure 4 shows the BioTracer architecture. We use Lucene as a basis for indexing documents and handling queries, extended with a graphical user interface (GUI) based on JSP (Java Sever Pages) and AJAX (Asynchronous JavaScript and XML). The index is constructed mainly based on the MEDLINE database, which is parsed and handled by the Document Handler. To facilitate the parsing and generation of Lucene-friendly documents, we use Lingpipe8 MEDLINE parser as a base for implementing the Document Handler. This handler also interacts with both the Language Model Handler and IndexManager to index documents based on a chosen language or similarity model. All search is logged in the Search Log repository. This is used to “learn from experience”, allowing BioTracer to use data from previous searches and user choices/interactions (e.g., URF). The Log is implemented using MySQL. Figure 5 shows the implemented Web-based GUI, which also illustrates the URF possibility as described above. Each returned relevant hit can be marked by the user. The system will then use this information to re-rank the results. Note that due to US National Library of Medicine (NLM) license regulations, BioTracer is not allowed to provide abstracts directly from MEDLINE. However, we have implemented a workaround that still allows users to browse the abstracts 8

See http://alias-i.com/lingpipe/

86

H. Ramampiaro and C. Li To Entrez Web Services Web Service Interface Web-based UI Query

Autocompletion Handler

Search Engine

Search Results

Query Manger

Document Handler

Rewritten Query

Ranking Model "Plug-in" Seach Log

Index Manager

Keyword index

Indexes

Fig. 4. The BioTracer architecture

Fig. 5. Screen dump showing the BioTracer search environment

Documents Collection

Supporting BioMedical Information Retrieval: The BioTracer Approach

87

inside BioTracer. The solution is using the Entrez Web Services interfaces in combination with AJAX. This allows us to retrieve a speciﬁc abstract from PubMed based on the PubMed ID (if available) and process it. In this way, we are able to allow users to study the abstracts inside BioTracer, rather than going to another Web page. Moreover, if the user wants to examine more than one abstract, our AJAX implementation allows several PubMed abstract windows to be popped up at a time. 5.2

Experimental Results

Experimental Setup: To test the retrieval performance of BioTracer, we have performed the evaluation based on the TREC 2004 Genomic Track [24] test collection. The main reason for choosing this collection is its number of topics covered, which span from gene name searches to disease searches. It allows us to do a comprehensive evaluation of the BioTracer retrieval performance. The TREC 2004 collection originally consists of 4,591,008 MEDLINE documents. Our evaluation was based on a subset of 42,255 documents that were judged against 50 topics. Each topic consists of title, need, and context ﬁelds. The queries were generated manually from the topics using all the ﬁelds. The runs were carried out as follows. First, we tested the generated queries against an oﬀ-the-shelf Lucene similarity model. Then, we used the same queries on the extended models that we discussed earlier. To evaluate the retrieval performance, we used the same approach implemented for Text REtrieval Conference (TREC), that is, by focusing on TREC mean average precision (MAP). For each query, we may get the average precision by computing the average precision value for the set of top-k documents after each document is retrieved. Thus, MAP is the mean of the average precision values for all the queries. MAP has been widely used as the measure to evaluate ad hoc retrieval results [25,26]. It has also been used in the TREC Genomics Track evaluation [24]. Since MAP is sensitive to the rank of every relevant document, it reﬂects well the overall ranking accuracy. For this reason, the best result is the one with highest MAP values. Further, since users are generally most interested in the results within the top 100 retrieved documents, we have stressed measuring the BioTracer’s ability to ﬁnd relevant data among the top 100 hits. Therefore, we measured speciﬁcally the precision at 10 retrieved documents (P@10) and precision at 100 (P@100). To make our evaluation as independent as possible and to be able to compare it with previous results, we used the Buckley’s trec eval program9 . Using trec eval, the maximum number of allowed retrieved documents for each query was, by default, set to 1000. For each run, documents without any speciﬁc judgement were treated as non-relevant. Results: Table 2 summarizes the results from our evaluation. The ﬁrst column is the result from running the baseline method using the LuceneTFIDF-based 9

See http://trec.nist.gov/trec eval/

88

H. Ramampiaro and C. Li

similarity model (VSM). The next column shows the results after we modiﬁed this model with the same extensions as those for BM25. Furthermore, the third column contains the result from running the baseline BM25 model. The result from our extensions in Sections 2, 3, and 3.4 are presented in the “Boolean+BM25” column, and the results from using user relevance feedback is shown in the “URF BM25” column. In addition to Table 2, Figure 6 shows the average precision values with respect to the number of retrieved documents and the used method. The diﬀerent number of retrieved documents are represented by precision points. As an example, a precision point 20 means that we calculated the average precision at 20 retrieved documents. Note that in addition to these results, we have tested the BioTracer prototype using Divergence From Randomness (DFR) BM25 [15], and Lemur TF-IDF models [14] with boolean and document boosts. However, the results from these runs were not signiﬁcantly diﬀerent from those of Boolean BM25. Therefore, they are not included in this paper. Table 2. Results from running based on pooled TREC 2004 corpus Baseline Lucene TFIDF MAP 0.346 R-precision 0.3837 P@10 0.534 P@100 0.3464 Total recall 0.67

Custom TFIDF

Baseline BM25

Boolean +BM25

URF BM25

0.4975 0.5443 0.652 0.4552 0.737

0.398 0.4547 0.594 0.3974 0.637

0.5122 0.558 0.666 0.469 0.735

0.5129 0.5527 0.712 0.4734 0.738

The results in Table 2 and Figure 6 show that there are only slight diﬀerences among the extended runs, although the URF BM25 had the overall best retrieval performance. These diﬀerences were most obvious for the top-15 hits. We can further see that the extended models performed much better than the baseline ones, which also show the eﬀects of applying our extensions. For the TF-IDF model (i.e., the vector space model), the improvement of our custom TFIDF over the baseline model was 43.8%. The improvements for the BM25based models were 28% and 29% respectively from the baseline model to the extended BM25 model and from the baseline model to URF BM25. Focusing on these results, adding the extensions is useful, in terms of retrieval performance. Further, comparing to the results from TREC 2004 Genomic Track [24] and assuming similar experiments, all of our extended models performed better than the best run from this track, where the MAP, P@10 and P@100 were to 0.4075, 0.604 and 0.4196, respectively [24]. Despite the above promising results, there are issues that we have to address. First, our evaluation was based on a somehow “closed” test collection. Although we have argued its comprehensiveness, we recognize that such a collection would hardly cover absolutely all aspects of users’ needs in real search environments.

Supporting BioMedical Information Retrieval: The BioTracer Approach 7),')5XQV

%05XQV

%DVHOLQH7),') ([W7),') 75(&

75(& %DVHOLQH%0 85)%0 ([W%0

$YJSUHFLVLRQYDOXHV

$YJSUHFLVLRQYDOXHV

89

3UHFLVLRQSRLQWV

3UHFLVLRQSRLQWV

Fig. 6. Graphs for top-k precision values

This in combination with the fact that the BioTracer prototype can be seen as a highly interactive information search system, the generality/completeness and validity of the collection might be debatable. A way to make our evaluation more general is to involve real users, and study how they interact with BioTracer, as well as their evaluation of the retrieval performance. However, it is still not possible to guarantee a perfect evaluation [27], since we have to rely on several factors [28] such as the users’ willingness to answer the interviews, the completeness of the questionnaire, and the broadness of the users. Execution Speed: With respect to execution speed, although we argued the necessity of having an information retrieval system that can handle high-speed search (see Section 4.2), it has not been the main focus of this work. Rigorous performance tests, including comparison with other approaches, are needed to assess our system performance. Still, our informal tests have shown promising results. We tried to run our tests using 4.5 Millions documents, a subset of MEDLINE entries on an Apple iMac machine with 2.93GHz Intel i7 CPU (quad core) and 8GB RAM. Using the aforementioned queries from the Pubmed one-day query log [23] (see Section 3.4), the search speeds were measured as follows. First, focusing on the speed as function of the document frequency, which varied from 1 document to 1.3 Millions documents and with one term in each query, the execution speed was constantly below 100 ms. Most speciﬁcally, the execution speed was 68 ms at maximum and 3.6 ms in average. Second, focusing on the speed as function of the number of query terms, our system managed to run our queries within 150 ms in average, when we varied our query lengths from 1 to 25 terms, whereas the maximum speed was 657 ms. Here, we got the lowest speeds when the number of queries were higher than 20. We believe the main reason for this is the way Lucene parses and process the queries.

90

H. Ramampiaro and C. Li

In summary, the above results shows that our system is able to execute any query very fast independent of the number of returned results as long as the we keep the number terms in the query low (i.e., below 20 terms).

6

Related Work

There are many methods and systems suggested and developed in the biomedical information retrieval research domain. Most previous work has been presented in several TREC Genomic Track conferences, including the aforementioned TREC 2004 [24]. Other approaches applying tokenization heuristics were studied by Jiang and Zhai [10]. Their evaluation has shown that these methods can successfully improve the retrieval performance. Like ours, many existing approaches have focused on expanding queries [24, 29]. However, we have attempted to go further by combining the use of diﬀerent ranking models that can be adapted to diﬀerent types of information needs, boosting diﬀerent parts in the documents, applying a query-document correlation factor, using wildcard queries as query extension rather than stemming, using boolean queries with the BM25, and involving the users by user relevance feedback. Concerning retrieval systems, there are several related systems. Because these systems mainly use proprietary ranking and indexing strategies, our discussion will focus on the system features and functionality. First, perhaps the most used retrieval tool is PubMed10 . PubMed provides several useful features for searching biomedical publications, such as the use of MeSH terms to expand queries and the possibility to access related articles. However, to our best knowledge, it does not provide ranking based on document relevance, per se. Instead, PubMed seems to use a proprietary algorithm to order the search results based on publication date, author names, and journals. Nevertheless, a more useful Web-based GUI extension to PubMed is available, called HubMed [30]. Like BioTracer, HubMed implements web service facilities for PubMed. Another promising system is iPubmed [31]. The iPubmed system focuses on providing high performance and interactive search interface to MEDLINE data, through a ”search-as-you-type” feature. It means that a search is initiated as soon as a keystroke is hit. Another useful feature of this system is their implementation of fuzzy search, allowing correct retrieval of documents even though users have misspelled their queries. Somehow, our autocompletion feature is inspired by this system with respect to instant and fuzzy search. The main difference is how the search results are retrieved and the ranking strategy applied. Third, ScienceDirect is a search system11 . It makes use of the powerful Scirus search facility. Scirus uses a more sophisticated ranking algorithm, which seems to produce more relevant results than PubMed. To our best knowledge, the main diﬀerence with BioTracer is that ScienceDirect/Scirus does not allow any 10 11

See http://www.pubmed.org See http://www.sciencedirect.com

Supporting BioMedical Information Retrieval: The BioTracer Approach

91

kind of user relevance feedback to reﬁne the search. It is also worth noting that SienceDirect is a commercial system that covers a much larger scientiﬁc area than BioTracer. Therefore, the search results from this systems often include other than biomedical documents. A fourth system worth discussing is Textpresso [32], which is also a search engine allowing specialized search of biomedical information. In contrast to BioTracer, however, Textpresso extensively uses ontologies in the retrieval process. For a biomedical text, each word or phrase is marked by a term in the ontology when they are indexed. Its ranking algorithm is based on the frequencies of queried index terms. This means that the document containing the largest number of query terms is ranked at the top. To our best knowledge, Textpresso does not oﬀer any user relevance feedback feature. Fifth, BioIE [33] is a rule-based information retrieval system that extracts informative sentences from the biomedical literature such as MEDLINE abstracts. It uses MEDLINE as the main underlying searchable information. In addition, it allows users to upload their own text to be searchable. Both statistical analysis (word distribution, ﬁltered word distribution, N-gram distribution and MeSH term distribution) and sentence extraction can be performed. BioIE extracts informative sentences based on predeﬁned templates for a user-speciﬁed category. The ranking algorithms of BioIE are based on the numbers of occurrences of terms in the query, much like Textpresso. BioIE does not support user relevance feedback.

7

Conclusion and Future Work

In this paper we presented our ongoing work towards the development of an information retrieval system, called BioTracer. It is a search prototype that allows customized ranking of search results through boosting speciﬁc parts of documents, customizing scoring schemes, and allowing users to aﬀect the ranking through user relevance feedback (URF). In summary, we have investigated the eﬀect of using and/or extending existing models like TF/IDF-based models and BM25 with support for boolean queries, boosting documents and document parts, as well as enabling URF on retrieval performance and result ranking. Focusing on URF and the ability to customize the retrieval process, users are more ﬂexible to specify their views of relevance. Thus we can address the challenges that we face in searching biomedical information. In this respect, the major contribution of this work is the development of the BioTracer search system providing adaptable searching and ranking of biomedical information. Although we built the system on existing techniques, to our best knowledge, the way we integrated them and extended them in one system is unique. Moreover, we have done experiments with the TREC collection to investigate the system’s retrieval performance. We believe that these experiments on biomedical information, leading to our comparison of the diﬀerent ranking models and their retrieval characteristics, are in itself an interesting result. Nevertheless, focusing on BioTracer as a whole, the main conclusion from these experiments is that BioTracer is a tool

92

H. Ramampiaro and C. Li

that is able to retrieve relevant information and that our extensions have helped improving the retrieval performance. There are still challenges left for further studies. First, the system has only been tested against a TREC corpus. As discussed earlier, the strength and the validity of such a test can be improved. Therefore, we recognize the necessity of doing more empirical experiments with real users in a realistic environment. This additional work will reveal areas where our work can be improved. Further, we will further investigate the eﬀect of including natural language processing (NLP) methods in the document handling process. This includes applying named-entity recognition (NER) to identify important keywords and sentences, thus further improving the way parts of a document are boosted. We are also investigating the use of controlled vocabularies such as MeSH [8] in both for query term boosting and extending queries. Finally, we will include a text classiﬁcation facility. Our preliminary experiments based the Support Vector Machine (SVM) have, by far, shown good potential in terms of increased search precision. However, more work is needed to make such approaches eﬃcient enough so that it does not degrade the overall system performance. Acknowledgements. The author would like to thank Jon Olav Hauglid, Roger Midtstraum, Reidar Conradi, and Min-Yen Kan for their suggestions to improve this paper.

References 1. Ramampiaro, H.: BioMedical information retrieval: The bioTracer approach. In: Khuri, S., Lhotsk´ a, L., Pisanti, N. (eds.) ITBAM 2010. LNCS, vol. 6266, pp. 143– 157. Springer, Heidelberg (2010) 2. Krauthammer, M., Nenadic, G.: Term identiﬁcation in the biomedical literature. Journal of Biomedical Informatics 37(6), 512–526 (2004) 3. Chen, L., Liu, H., Friedman, C.: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21(2), 248–256 (2005) 4. Netzel, R., Perez-Iratxeta, C., Bork, P., Andrade, M.A.: The way we write. EMBO Reports 4(5), 446–451 (2003) 5. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. AddisonWesley Longman Publishing Co., Inc., Boston (1999) 6. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988) 7. Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, 1st edn. Addison-Wesley, Reading (2009) 8. Lowe, H.J., Barnett, G.O.: Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. JAMA 271(14), 1103–1108 (1994) 9. Trieschnigg, D., Kraaij, W., de Jong, F.: The inﬂuence of basic tokenization on biomedical document retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, p. 803 (2007) 10. Jiang, J., Zhai, C.: An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval 10(4-5), 341–363 (2007)

Supporting BioMedical Information Retrieval: The BioTracer Approach

93

11. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2011) 12. Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications Co., 209 Bruce Park Ave., Greenwich, CT 06830 (2005) 13. Robertson, S.E., Jones, K.S.: Simple proven approaches to text retrieval. Technical Report 356, University of Cambridge (1994) 14. Zhai, C.: Notes on the lemur TFIDF model. note with lemur 1.9 documentation. Technical report, School of CS, CMU (2001) 15. Amati, G., Rijsbergen, C.J.V.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002) 16. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 275–281. ACM, New York (1998) 17. Wilkinson, R.: Eﬀective retrieval of structured documents. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 311–317. Springer-Verlag New York, Inc., New York (1994) 18. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted ﬁelds. In: CIKM 2004: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42–49. ACM, Washington, D.C., USA (2004) 19. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Brieﬁngs in Bioinformatics 6(1), 57–71 (2005) 20. Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brieﬁngs in Bioinformatics 6(4), 357 (2005) 21. Kabiljo, R., Clegg, A., Shepherd, A.: A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics 10(1), 233 (2009) 22. Johannsson, D.V.: Biomedical information retrieval based on document-level term boosting. Master’s thesis, Norwegian University of Science and Technology (NTNU) (2009) 23. Herskovic, J., Tanaka, L., Hersh, W., Bernstam, E.: A day in the life of PubMed: Analysis of a typical days query log. Journal of the American Medical Informatics Association 14(2), 212–220 (2007) 24. Hersh, W.R., Bhupatiraju, R.T., Ross, L., Roberts, P., Cohen, A.M., Kraemer, D.F.: Enhancing access to the bibliome: the trec 2004 genomics track. Journal of Biomedical Discovery and Collaboration 1(3), 10 (2006) 25. Yilmaz, E., Aslam, J.A.: Estimating average precision when judgments are incomplete. Knowledge and Information Systems 16(2), 173–211 (2008) 26. Voorhees, E.M.: On test collections for adaptive information retrieval. Inf. Process. Manage. 44(6), 1879–1885 (2008) 27. K¨ aki, M., Aula, A.: Controlling the complexity in comparing search user interfaces via user studies. Information Processing and Management 44(1), 82–91 (2008); Evaluation of Interactive Information Retrieval Systems 28. Kelly, D., Harper, D.J., Landau, B.: Questionnaire mode eﬀects in interactive information retrieval experiments. Information Processing and Management 44(1), 122–141 (2008); Evaluation of Interactive Information Retrieval Systems 29. Abdou, S., Savoy, J.: Searching in Medline: Query expansion and manual indexing evaluation. Information Processing & Management 44(2), 781–789 (2008)

94

H. Ramampiaro and C. Li

30. Eaton, A.D.: Hubmed: a web-based biomedical literature search interface. Nucleic Acids Research 34(Web Server issue), W745–W747 (2006) 31. Wang, J., Cetindil, I., Ji, S., Li, C., Xie, X., Li, G., Feng, J.: Interactive and fuzzy search: a dynamic way to explore medline. Bioinformatics 26(18), 2321–2327 (2010) 32. Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004) 33. Divoli, A., Attwood, T.K.: BioIE: extracting informative sentences from the biomedical literature. Bioinformatics 21, 2138–2139 (2005)

Electronic Health Record Data-as-a-Services Composition Based on Query Rewriting Idir Amine Amarouche1 , Djamal Benslimane2 , Mahmoud Barhamgi2 , Michael Mrissa2, and Zaia Alimazighi1 1

Universit´e des Sciences et de la Technologie Houari Boumediene BP 32 El Alia 16111 Bab Ezzouar, Alger, Algeria [email protected], [email protected] 2 Universit´e de Lyon, CNRS Universit´e Lyon 1, LIRIS UMR5205 43, bd du 11 novembre 1918, Villeurbanne, F-69622, France {firstname.lastname}@liris.cnrs.fr

Abstract. Due to the large development of medical information systems over the last few years, there is today a strong need for an infrastructure that uniformly integrates the distributed and heterogeneous collections of patient data to deliver value-added information to healthcare professionals at the points of care. The adoption of Electronic Health Records (EHRs) and Web services as a software infrastructure has become an extremely important prerequisite for patient data integration. In this paper we propose a semantic-enabled architecture for the automatic composition of EHR (Electronic Health Record) DaaSs (Data-as-a-Service). In our architecture, DaaSs are selected and composed automatically to resolve the user queries (i.e. queries posed by physicians, nurses, etc) using a query rewriting approach. Our proposed approach can also handle the semantic conﬂicts of data exchanged among component services in an EHR DaaS composition by deriving and applying automatically the necessary data conversions. Keywords: Electronic Health Record (EHR), Data as a Service (DaaS), query rewriting, semantic annotation, composition, mediation.

1

Introduction

Our current health environment is characterized by a shared and distributed localization of patient information. Patients’ data are spread across several autonomous, proprietary and heterogeneous information systems. The adoption of electronically formatted patient data with Electronic Health Records (EHR) has become the primary concern for a broad range of health information technology applications and practitioners. According to Healthcare Information and Management Systems Society [22] the Electronic Health Record (EHR) of an individual consists of a collection of lifetime health data in electronic format, generated during relevant interactions with the healthcare system. A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 95–123, 2011. c Springer-Verlag Berlin Heidelberg 2011

96

I.A. Amarouche et al.

In this context, one of the big challenges of the health actors is the communication, the sharing (via exchange and integration) of EHR data through several independent and heterogeneous health systems. Thus, the EHR data need to be available, discoverable, searchable and comparable by a connected group of care providers and health organizations. For this reason, many eﬀorts have been undertaken to identify the requirements and information architectures needed to support shared Electronic Health Records. These research projects have focused on supporting the care given to patients by promoting good designs for EHR systems and standards for the secure communication of part or all of a patient’s EHR between authorized systems [4]. Also, many e-health systems already provide the possibility to export their data in standardized formats such as CEN TC251, openEHR, and HL7-CDA [3] which provide diﬀerent ways to structure and markup the EHR data for exchange purpose. For this, the adoption of standardized Electronic Health Records has become an extremely important prerequisite for bringing interoperability and eﬀective data integration to the healthcare industry [22]. However, these EHR-related standards have centered on the communication of parts of the EHR of an individual subject of care and deal with the patient data at the document level. Furthermore, many of the users’ query requirements target the contents of the clinical documents [26] and little work has been done to date on deﬁning a generic means of querying EHR systems, as distributed repositories, in a consistent way [4]. Today there is increasing interest in moving towards a Service-Oriented Architecture for EHR data sharing among independent health information systems [28] [19] [23] [22]. Web service technology can be used as a standardized way for accessing and sharing the EHR data over healthcare information systems. This type of services is known as Data-as-a-Service (DaaS). DaaSs allow for a query-like access to organizations’ data sources and do not change the state of the world [34] [37]. In this paper we use the term EHR DaaS to denote the DaaS that provides EHR data (or parts of). While individual DaaSs may provide interesting medical information alone, in most cases, users’ queries require the composition of multiple DaaSs. Furthermore, as there are several EHR DaaSs provided by several health actors, the user (e.g. physicians) needs an assistance to discover, select and compose the required EHR DaaSs. Therefore a solution is needed to select and compose EHR DaaS automatically for the purpose of retrieving and integrating the EHR data, which is the problem we tackle in this paper. This problem is very complex since EHR DaaSs are developed by independent organizations that may use diﬀerent standards to describe their data. To address these challenges, we propose a declarative query-rewriting based approach for the automatic composition of EHR DaaSs. The key idea behind the approach is to describe DaaSs as views over medical ontologies to capture their semantics in a declarative way. Deﬁned views are then used to annotate the EHR DaaSs’ description ﬁles (e.g. WSDLs) and exploited to compose DaaSs automatically. They are also used in resolving (on the ﬂy) the semantic conﬂicts

EHR Data-as-a-Services Composition Based on Query Rewriting

97

of data exchanged inside DaaS compositions. Our composition framework is based on an RDF query rewriting algorithm [6] inspired by the mature research work done in the data integration area [17]. The rest of paper is organized as follows: Section 2 provides a motivation example, highlights the challenges addressed in this paper and describes our contribution. Section 3 provides some background knowledge about EHR data and EHR DaaS. Section 4 outlines our service-oriented approach for EHR DaaSs composition. We also present in this section our results in terms of models, which include an ontology model, a model for services (EHR DaaS and mediation services) and a conjunctive query model. In Section 5, we introduce the generic algorithmic solution for query processing, which includes a query rewriting approach for EHR DaaS composition and the automatic invocation of mediation services for the resolution of semantic conﬂicts. Section 6 shows the system implementation and Section 7 is devoted to related works. Section 8 summarizes the results obtained in this work and discusses some possible extensions.

2

Motivation, Challenges and Contributions

In this section, we provide an illustrating example where the information needs of health actors are satisﬁed with a service oriented approach. This approach raises up many problems, which motivate our proposal to apply semantic Web technologies to support EHR DaaSs composition. 2.1

Motivation Example

Let us consider an e-health system exporting the set of EHR DaaSs in Table 1 to query the patient data. The description of EHR DaaS can be seen in Table 1, where the symbols “$” and “?” denote inputs and outputs of EHR DaaSs, respectively. We assume that a physician wants to consult the laboratory test results for his patient, “Joe, 35 years old man”. Laboratory test results are helpful tools for evaluating the health status of an individual. In each laboratory test order, we ﬁnd several tests (Cholesterol rate, Bilirubine rate, etc...). The physician submits the following query, as shown in Figure 1: Q1 : “What are the pathologies indicated by the results of the laboratory tests of Joe” For the sake of simplicity, we assume that the e-health system does not provides functionality (service location record) to ﬁnd any EHR DaaSs providing Joe’s health data. Doing so, the physician has to invoke the EHR DaaS that provides the recent laboratory tests ordered for Joe namely S11 or S12 . He will not invoke S13 or S14 which return the ordered test made by Gynecologist and Paediatrician specialist respectively. Also he will not invoke S15 because this service returns the laboratory tests ordered but aborted for a patient. After invoking S11 and S12 he will obtain the list of recent and successful laboratory tests ordered for Joe and the laboratories’ names in charge for performing the tests. Then he will invoke S21 and S22 to retrieve the results of the laboratory tests ordered for Joe and made by laboratory 1 and laboratory 2 respectively.

98

I.A. Amarouche et al. Table 1. Example of Electronic Health Record Data-as-a-Services

Service

Functionality

Constraints and DaaS The employed provider health standard S11 ($x, ?y) Returns laboratory DaaS provider is hospital1 tests y ordered for a given patient x S12 ($x, ?y) DaaS provider is hospital2 S13 ($x, ?y) Patient gender (woman) , DaaS provider is maternity hospital S14 ($x, ?y) Patient age (< 14), DaaS provider is paediatric private hospital S15 ($x, ?y) Returns laboratory DaaS provider is hospital1 test y aborted for a given patient x S21 ($x, ?y, ?z) Return the name y z.unit (unit of measure) y.code ∈ {LOIN C} and the value z of a is mg/l, DaaS provider is given test belonging laboratory 1 to lab test order x S22 ($x, ?y, ?z) z.unit (unit of measure) is y.code ∈ {SN OM ED} mmol/l, DaaS provider is laboratory 2 S23 ($x, ?y, ?z) DaaS provider is laboratory 3 S3 ($x, ?y, ?z) Returns low refer- z.unit and y.unit (unit of x.code ∈ {SN OM ED} ence value y and measure) is mg/l hight reference value z for given lab test x S4 ($x, ?z) Returns indicated z.code ∈ {ICD} and disease z for abx.code ∈ {SN OM ED} normal value of lab-test x

Furthermore, according to his own experience, the physician will not invoke S23 because of the inferior quality test results returned by the laboratory 3. After that, he will invoke S3 which returns the reference or normalized values for each laboratory test parameter, in order to compare with the values returned by S21 and S22 . Then, if he found any suspicious values he invokes S4 to retrieve the pathology indicated by each abnormal value. The list of pathologies returned by S4 will indicate to him the pathologies of Joe may suﬀer from, and for which a treatment must be applied or another investigation is needed. It is necessary to mention that during the comparison between the values of the test results returned by S21 or S22 and the references values returned by S3 , the physician must operate a conversion between value units ( mmol/l and

EHR Data-as-a-Services Composition Based on Query Rewriting

99

mg/l)1 because each piece of data is interpreted diﬀerently. Also, the physician may need to convert exchanged data between selected services. For example, he has to change the laboratory test code returned by S21 (codiﬁed using the LOINC standard) to codes acceptable by S3 (codiﬁed using the SNOMED standard) In short, the physician needs to discover and select services, to invoke them in a certain order, to make sure that the parameters of the services are compatible, to consolidate the results returned by each EHR DaaSs and to manually perform an ordered set of operations like joins, selections and projections.

Query : What are the pathologies indicated by the results of the laboratory tests of Joe L11 : Lab_order

S21

Patient_id

S3 S11

Patient_id

L21 : Lab_result

S12

S22

L3 : Abnormal lab_result values S4

L22 : Lab_result

L12 : Lab_order

L4 : pathology indicated by abnormal laboratory test result

Li : Data provided by the ith EHR DaaS services.

Fig. 1. Physician query scenario

2.2

Challenges

As shown previously, the manual process of composing EHR DaaSs for answering a query is painful and tedious; it may not possible for non expert users ( eg. physicians, users...) to compose EHR DaaSs. Thus, automating the composition of EHR DaaS raises the following challenges: 1. Understanding the Electronic Health Record Data-as-a-Service Semantics: For the physician, confusion occurs in correctly understanding the functionalities provided by several EHR DaaSs. For instance, EHR DaaS like S11 and S15 have the same signature (input, output) but provide diﬀerent functionalities, the former provides the laboratory test ordered for patient, whereas the latter provides the laboratory tests aborted for patient. Also, with several similar EHR DaaSs that oﬀer the same functionality (like S11 or S12 or S13 ) but that are subject to diﬀerent constraints on data (patient age, patient gender,. . . ), the physician must manually choose the ones that contribute to answering the query. Thus, the semantic annotation of EHR DaaS description ﬁles with the relationship between input and output on the one hand, and its content on another hand, will contribute mainly to automatic processing of EHR DaaSs discovery and composition and will spare the physician from selecting between several EHR DaaSs manually. 1

In laboratory test results, mmol/l and mg/l are mass concentration unit measures.

100

I.A. Amarouche et al.

2. Electronic Health Record Data-as-a-Service discovery and composition: The physician must select the services that are relevant to his query and compose them in the right order. 3. Data level conﬂicts detection and resolution in Electronic Health Record Data-as-a-Services composition: EHR DaaS parameters use concepts from diﬀerent health ontologies such as SNOMED2 , ICD3 , LOINC4 , UMLS5 , ICPC6 ...etc. For instance, S21 provides laboratory test coded in LOINC, but S3 requires the laboratory test result speciﬁed using SNOMED terminology. Also, S22 returns test result measured in mmol/l but S3 measured its test result in mg/l. Consequently, the data will be passed during composition from S21 and S22 to S3 will provoke an incompatibility problem. Then, how to leverage semantic conﬂict (health ontologies concept, unit,....etc) to enable the unrestricted composition of EHR DaaSs in a generic EHR environment becomes another challenge. 4. Data trustworthiness: Beside the semantic description of EHR DaaS, the physician need to have a mean to ensure the consistency and trustworthiness of data that are returned by diﬀerent EHR DaaSs. Trustworthiness depends a lot on where data came from and which parties were involved in the establishment, change and forwarding of the data. 2.3

Contributions

Among the above mentioned challenges, we will focus only on the ﬁrst three ones and we let the last one for future work. In a nutshell, we propose a new approach to integrate EHR data provided by several EHR DaaSs using a two-layer mediated ontology. The ﬁrst layer, named Generic Domain Ontology, forms the core ontology (core concepts and relations) and it is the same for all EHR DaaSs providers. This layer is used to automate the composition of EHR DaaSs. The second layer, named Speciﬁc Domain Ontology, represents the (contextual) semantics (measuring units, scales, etc) of the data attached to EHR DaaS (called input/output parameters) and is used to detect and resolve the semantic conﬂicts of data exchanged among composed services. The use of this two-level ontology allows deriving automatically the EHR DaaSs compositions that incorporate necessary mediation services (to carry out data conversion between interconnected services) during EHR DaaS composition. Our main contributions in the paper are summarized as follows: – Firstly, we handle the ﬁrst challenge by proposing an RDF-based modeling for EHR DaaSs. Speciﬁcally, we represent EHR DaaSs as RDF views over a mediated ontology. RDF views allow capturing the semantics of a DaaS in 2 3 4 5 6

SNOMED: The Systematized Nomenclature of Medicine. ICD: International Codiﬁcation of Disease. LOINC: Logical Observation Identiﬁers Names and Codes. UMLS: Uniﬁed Medical Language System. ICPC: The International Classiﬁcation of Primary Care.

EHR Data-as-a-Services Composition Based on Query Rewriting

101

a declarative manner using concepts and relations whose semantics are formally deﬁned in ontologies. We adopt SPARQL, the de facto query language for the Semantic Web, for posing queries over EHR DaaS services. – Secondly, we propose a query rewriting approach to automatically select and compose the EHR DaaSs. In our approach composition queries, speciﬁed as SPARQL queries over a mediated ontology, are reformulated in terms of available EHR DaaSs based on the deﬁned RDF views. Query reformulations are then translated into composition plans (i.e. orchestrations) that can be executed to answer the posed queries. – Thirdly, since the data provided and required by individual EHR DaaSs may be bound to diﬀerent (contextual) semantics (e.g. measuring units, scales, etc), we propose a mechanism that inserts automatically mediation services in compositions in order to resolve the semantic incompatibilities detected in the generated EHR DaaS compositions.

3

Background

In this section, we describe the features of EHR data published by EHR DaaSs. We also look at the relevant aspects of EHR data integration using Web service technology and the standards that EHR systems utilize nowadays. Furthermore, we look at the query rewriting approach for view based data integration, which is utilized in this work for EHR DaaS composition. 3.1

Electronic Health Record Standards

There have been various deﬁnitions of EHRs. According to the Healthcare Information and Management Systems Society7 “EHR is a longitudinal electronic record of patient health information generated by one or more encounters in any care delivery setting”. A comprehensive survey and analysis of the electronic healthcare record is available in [14]. EHR data is stored in many kinds of systems and proprietary formats, inducing diﬀerent internal structures. This situation leads to severe interoperability problems in the healthcare informatics domain. For this purpose, several EHR standards have been developed in order to structure the EHR data for the purpose of exchange. The standardization eﬀort focused around two considered areas. – The structuralization of EHR documents for the exchange of clinical data by supporting meaningful information representation between clinical information system within or between health care organizations. These standards include Health Level 7 (HL7) Clinical Document Architecture (CDA) [13], openEHR [35] and Cross-Enterprise Document Sharing (XDS) integration proﬁle IHE8 , .. etc. 7 8

HIMMS, http://www.himss.org IHE, “Integrating the Healthcare Enterprise” http://www.ihe.net.

102

I.A. Amarouche et al.

– The health ontology used to represent the EHR data. To name a few examples of health ontology that model parts of the medical domain : ICD, LOINC, SNOMED, UMLS,..etc. In sum, these approaches focus on the way of accessing the data rather than standardizing the data itself. A common feature of all emerging EHR standards is that the clinical concepts are modeled and expressed independently from how the data is actually stored in underlying data source. This challenge necessitated to: – Select an appropriate technological infrastructure for making EHR data available at the point of care when authorized users need. In this context, Web services technology has been largely applied in healthcare domain by encapsulating existing EHR data within the Web service model and providing access to clinical data in a standard way. – Adopt a novel modeling approach namely two-level modeling[14][7]. Twolevel modeling approach in EHR system development divides the EHR data models into two separate ones. A generic information model and domain knowledge model. The domain knowledge model contains a set of constraints model ((simple and complex type), internal consistency (type, interval values, scale, unit, range), Reference Data (XML Format, health ontology)) on instance of the generic model entities. 3.2

Electronic Health Record Data Integration and Web Services

As explained previously, the introduction of Web service technology is motivated by the need to encapsulate the patient data in case of a EHR data exchange with another organization to perform a specialized medical procedure or for continuation of care. During this exchange, the execution of many operations are performed on EHR data before their exchange that concern (see ﬁgure 2). – Retrieving relevant authorized patient data from the health organization information system, for instance : “Problem list”, “past illness”, “medication use”, “present illness”, “Family history”, “Past surgical”, “allergies”,..etc. – Coding this data using numerous standards that support ontological control at instance and type level by interlinking such health ontology (ICD10, SNOMED,...) with the data deﬁnitions in standardized EHR documents (HL7,..etc) [24]. Using the coded data for creating a EHR document complaint XML clinical model ( HL7/CDA, ...etc) ; – Sending this document as SOAP encoded message to an appropriate destination. However, the problem of EHR data integration is central in these systems. These latters deal with patient data at the document level, but health care data usage often is data centric, meaning that data should be extracted from various documents and then integrated according to speciﬁc criteria. As depicted by [24], even with a service approach, many interoperability problems still arise during EHR data integration.

EHR Data-as-a-Services Composition Based on Query Rewriting

EHR DaaS

Service level

Data level

103

Output

Input

Terminology and vocabularies

Codes association

Schema Validation & Constraint Analysis

EHR data instance Constraint model repository

EHR data query

EHR data Source

Fig. 2. Electronic Health Record data, publication through Electronic Health Record Data-as-a-Service

First, matching clinical data to codes in concurrent and semantically overlapping health ontology causes severe interoperability problems. Second, the semantic integration of heterogeneous systems in healthcare will have to deal with volatile medical concepts. For this reason, there is not, until now, a unique and comprehensive ontology of the medical domain [1] [30] [24]. In order to solve this problem, EHR data integration systems have to consider two levels; (1) generic information model and (2) domain knowledge (vocabularies, terminologies,..). These two levels must inter operate to integrate EHR data from disparate healthcare systems [33]. As every standard has its information reference model upon which domain knowledge is built, we will focus only on features of domain knowledge where EHR complaint XML documents at entry or section level are characterized by the frequent association with: – Ontological concepts deﬁned in some health ontologies (ICD, LOINC, SNOMED) for coding EHR; – Semantic and structural constraint for maintaining internal consistency of EHR data; Thus, the additional knowledge provided by the reference models upon which domain knowledge is built will not be addressed in this work. 3.3

Query Rewriting

The query rewriting problem has been extensively studied in the areas of query optimization and data integration. In the hereafter we report some deﬁnitions (based on the works [17,18]) to terms we use extensively throughout the paper.

104

I.A. Amarouche et al.

Query Rewriting: Given a query Q and a set of view deﬁnitions V = {V 1, ..., V m}, the query Q is an rewriting of Q using V if: Q is contained in Q and Q refers only to the views in V . Query Containment and Equivalence: Query containment and equivalence enable comparison between diﬀerent rewriting of a query. It will be used when we test the correctness of a rewriting of a query in terms of a set of views. Thus, a query Q1 is said to be contained in a query Q2, denoted by Q1 ⊆ Q2, if for all databases D, the set of tuple computed for Q1 is a subset of those computed for Q2, i.e., Q1(D) and Q2(D). The two queries are said to be equivalent if Q1 ⊆ Q2 and Q2 ⊆ Q1 or Q1 ≡ Q2. Maximally Contained Rewriting: Equivalent rewritings may not always exist under the open world assumption. Finding the maximally-contained rewriting will be the only alternative for resolving a query. Thus, Let Q be a query, V = {V 1, ..., V m} be a set of view deﬁnitions, and L be a query language. The query Q is a maximally-contained rewriting of Q using V with respect to L if: Q is a query in L that refers only to the views in V , Q is contained in Q, and there is no rewriting Q1 ∈ L, such that Q ⊆ Q1 ⊆ Q and Q1 is not equivalent to Q.

4

Overview of the Approach

In this section we deﬁne our reference architecture for EHR DaaSs composition that is independent from speciﬁc data standards. 4.1

General Architecture

Our reference architecture deﬁnes four logical tiers, as shown in Figure 3. – Data Level: The lowest level of the architecture contains information stored in diﬀerent components. These components can be databases that store all the medical information concerning patients or documents that preserve all oﬃcial documents generated during healthcare process. Also there are several clinical terminological and documentary resources that provide means to search and share clinical knowledge. – Service Level: The service level publishes the diﬀerent services provided by several systems to e-health actors. Services are either simple (one provider) or complex (multiple providers). This level provides two services categories. 1. Electronic Health Record Data-as-a-Services provide information about patients. We can ﬁnd two kinds of EHR DaaS in this category according to the nature of the data provided: EHR DaaSs that provide speciﬁc patient information (diseases, symptoms, medications or family history and so on) or EHR DaaSs that retrieve a clinical document complaint model (discharge summaries,...).

EHR Data-as-a-Services Composition Based on Query Rewriting

105

interface Level

User ( ontology browser , query interface)

Query interface

Result set

Mediated level Mediated ontology

EHR DaaS composition system Generic Domaine Ontology

Specific domain Ontology

Query rewriting module

Service selector module

Conflict detection and resolution

conversion fucntions repository

Service registry

Plan query execution

DaaS discovery and invocation

Service level DaaS semantic annotation

Mediation services

EHR DaaS DaaS12 DaaS11

Mediation service1

DaaS13

Mediation Service2

DaaS14

Data level

Data source

database

Documents

Xml document

Terminolgical ressource

Fig. 3. Overview of the Electronic Health Record Data-as-a-Service composition architecture

2. Mediation Services is used mainly for mapping and converting the output parameter of a speciﬁc EHR DaaS to the input parameter of a subsequent EHR DaaS during service composition. Mediation services provide the deﬁnition of cross-mappings between terminologies (e.g; UMLS Terminology Services) and extract from a EHR DaaS output parameters what the inputs of subsequent EHR DaaS need. These services advertise their WSDL deﬁnitions into a service registry. WSDL provides an XML-based grammar for describing a service interface. For automatic discovery, selection and composition of service, WSDL ﬁles are annotated with semantic entities from a mediated ontology. The service registry includes a set of services descriptions (WSDL ﬁles) semantically annotated with RDF views expressed in term of mediated ontology as in [6]. – Mediated Level: The mediated level is composed of two modules: 1. Mediated ontology: The mediated ontology contains all the concepts and relations deﬁned in EHR domain. It will be used to annotate and query services (EHR DaaSs, mediation services) in an environment of heterogeneous inter-working health information systems. We divide the ontology into two ontological levels which cuts the concept space into a

106

I.A. Amarouche et al.

generic domain ontology and a set of extensions named domain speciﬁc ontologies. 1-1. The Generic Domain Ontology (GDO) deﬁnes the generic concepts and relations covered the EHR domain and it is the same for all EHR DaaSs providers. For example, patients, disease, laboratory test are entities of the generic domain ontology. The generic ontology is diﬀerent from a speciﬁc ontology in that it only contains basic shared concepts and their properties. For example, the generic ontology states that a laboratory-test-result has a unit and a code, without specifying any speciﬁc unit or health ontology code (this information will be speciﬁed using the SDO detailed later). Doing so, the GDO is used mainly for EHR DaaS discovering and composition. 1-2. The Speciﬁc Domain Ontology (SDO) is created mainly to allow the semantic extension of generic ontology concepts for detecting and resolving semantic data conﬂicts. For example, the SDO states that a Laboratory-test-code or Disease-code are expressed in SNOMED or LOINC; or laboratory-test-value has mg/l as a unit of measure. 2. Electronic Health Record Data-as-a-Service composition system: Contains four sub-modules: a service selector module, a query rewriting module, a conﬂict detection and resolution module and a query plan execution module. The ﬁrst module receives the query from the user interface and analyzes it based on the mediated ontology for discovering the appropriate EHR DaaSs. The second module receives the set of semantic descriptions of discovered EHR DaaSs and applies a query rewriting algorithm that generates a set of valid and executable of EHR DaaS compositions. The third one iteratively processes each rewriting previously generated in order to detect incompatibilities (semantic conﬂicts at the data level) and invokes the appropriate mediation services for their resolution. Mediation services oﬀer the conversion functions deﬁned in the conversion repository and referenced by SDO concepts. The last module arranges the selected EHR DaaSs along with the added mediation services in a composition plan which will be executed to return the results to users. – Interface Level: The aim of this layer is to provide the interface for user whereby he can perform a query and receive results sets. 4.2

Models for Electronic Health Record Data-as-a-Services Composition

In this section, we propose models to address the issues related to query processing (query rewriting and conﬂict resolution) for EHR DaaSs composition. First, we formalize the notion of mediated ontology, with the introduction of generic

EHR Data-as-a-Services Composition Based on Query Rewriting

107

and speciﬁc domain ontologies, which are useful as a support for semantic-aware querying and annotation of EHR DaaSs. Then, we propose a model for representing conjunctive queries over a mediated ontology. Finally, we develop models for EHR DaaS and mediation services. Mediated Ontology Mediated ontology includes two ontologies, namely, the generic and speciﬁc domain ontologies which have ‘GDO’ and ‘SDO’ as namespaces for their respective concepts. Such ontology should be deﬁned by domain experts and speciﬁed using RDF/RDFS. The generic and speciﬁc ontologies models are inspired from [6] [31] [25]. In order to provide a precise semantic annotation for our service model we use these two models. Definition 1. (Generic Domain Ontology) : A RDFS generic ontology is 6tuple < C,D,OP, DP, SC, SP> where – C is a set of classes. – D is a set of data types. – OP is a set of object properties. Each object property has its own domain and range in C. – DP is a set of data type properties. Each data type property has a domain in C and range in D. – SC is a relation over C × C, representing the sub-class relationship between classes. For example C2 SC C1 expresses that C2 is subclass of C1 . – SP is a relation over (OP × OP ) ∪ (DP × DP ), representing the subproperty relationship between homogeneous properties. For example DP2 SP DP1 means that DP2 is a sub-property of DP1 . Figure 4 depicts the Generic Domain Ontology, in which class nodes are represented by ovals and data type nodes are represented by rectangles. In the GDO ontology, the GDO:Patient class is a core concept that characterizes patient information, such as name, SSN, etc. The GDO:Laboratory-test-order class captures information on laboratory tests ordered for a patient. It is related to the GDO:Patient class through the object property GDO:Has-order. The GDO:Laboratory-test-Result class captures results of laboratory tests ordered for patient and is related to the GDO:Laboratory-test-order class via the GDO:Has-Result object property. Individuals of GDO:Laboratory-test-result such as LDL, AST, ALT, TotalBilirubin, etc. are subclasses of the GDO:Panel class. The individuals that belong to the GDO:Laboratory-test-Result class may be related to multiple panels and each panel has several laboratory tests. The GDO:Disease class characterizes the disease which can be indicated by patient GDO:Laboratory-test-result and is related to the laboratory test class via the GDO:indicate object property.

I.A. Amarouche et al.

code

SSN

physician

Gdo: has-SSN Gdo:has-name

age

Gdo:aborted

Gdo: has-age Gdo: has-gender gender

Gdo:has-order

laboratory Gdo:madeby panel

cla

cla

f:s

ub

ub

f :s ss

Gdo: indicate

Disease

Lab-testResult

maxval

Dname Property

Gdo: hasvalue

Gdo: hasmin Gdo:hasmax

Gdo: hasname

Classe

Gdo:hasdate

Lab-test Order

Gdo: has_result

Rd

Patology category

Gdo: orderd by

Patient

ss

name

date Gdo: hascode

Rd

108

Datatype

value

minval SubClass/Sub Property

Fig. 4. Generic Domain ontology

Definition 2. (Speciﬁc Domain Ontology) : A RDFS speciﬁc ontology is 3 tuple < Cg , Ci , τ >, where: – Cg is a set of concepts that represent the diﬀerent conﬂictual aspects of a generic concept in Generic Domain Ontology (GDO). Each Cg has a name and a set of specialized concepts (i.e. sub concepts); the name represents a conﬂictual aspect of the associated generic concept. In the example depicted in Figure 5, SDO:Laboratory-test-code and SDO:disease-code are Cg concepts. – Ci is a distinct set of concepts having the same super-concept Cg . By definition, Ci are not allowed to have sub-concepts. The properties of Ci are deﬁned as follows : • name of concept. • Seqno is the property that represents the sequence number of a Ci concept among its siblings. • A couple of properties reference the conversion functions between orderly organized object of Ci . The function name denotes the conversion from Ci to subsequent or precedent sibling, for instance snomed-to-loinc, loinc-to-snomed or mg/l-to-mmol/l, as it follows the mapping direction. Supported conversions between sibling subclasses are n −→ 1 and 1 −→ 1. – τ refers to the sibling relationships on Ci and Cg . The relationships among elements of Cg is disjoint. However elements of Ci of a given Cg have peer relationship. They have similar data semantics, so that conversion or mapping can be performed among them. Let us illustrate this deﬁnition with an example in Figure 5. The concept GDO:Laboratory-test-Result in that ﬁgure has a conﬂictual aspect called “unit” that is described as a member of Cg in SDO (i.e. SDO : unit). The

EHR Data-as-a-Services Composition Based on Query Rewriting

109

deﬁned concept SDO:unit is linked to GDO:Laboratory-test-Result via the object property SDO:has-Unit which is also deﬁned in SDO. SDO:unit has different measurement units represented as sub classes Ci = {mg/l, mmol/l, ..., n}. The code is also a conﬂictual aspect to both GDO:Laboratory-test-result and GDO:Disease concepts; i.e. codes can be represented diﬀerently in diﬀerent health ontologies using Ci = {SN OM ED, ICD, ..., n}. Note that, we can use an rdfs:collection to denote the sequence relationships between elements of Ci and typical processing will be to select one of the members of the container. GDO GDO: Disease

GDO: Laboratory test Result

GDO: indicate

SDO

SDO

SDO Disjoint

SDO :hascode SDO: Disease-code rdfs:subClassof

SDO :has-unit

SDO :hascode SDO: Laboratory-test-code

rdfs:subClassof

rdfs:subClassof

SDO:unit rdfs:subClassof

rdfs:subClassof

ICD: dicode SDO: Has_funct

Snomed: dicode SDO: Has_funct

ICD-tosnomed

Snomedto-ICD

Property

Snomed: icode

Loinc.code

ICD-tosnomed

Datatype

Mg/l

SDO: Has_funct

SDO: Has_funct

SDO: Has_funct

Snomedto-ICD

Classe

rdfs:subClassof rdfs:subClassof

Same as

SubClass/Sub Property

Mmol/l SDO: Has_funct

SI SDO: Has_funct

Mg/l to mmol/l

Mmol/l to Mg:l

SI to Mmo

-- to mg/l

Mmol/l to SI

SI to --

Sibling relationship

Fig. 5. Speciﬁc Domain ontology

Conjunctive Queries: In this paper we address conjunctive queries expressed using SPARQL, the do facto query language for the Semantic Web9 . Definition 3. A conjunctive queries Q has the form: Q(X):-< G(X, Y ), Cq > where : – Q(X) is the head of Q, it has the form of relational predicate and represents the result of query. – G(X,Y) is the body of Q, it contains a set of RDF triples where each triple is of the form (subject. property.object). X and Y are called the distinguished and existential variables respectively. X and Y are subjects and objects in the RDF triples. – Cq = {C1q , C2q , ...., Cnq } is a set of constraints expressed on X and Y variables in terms of traditional intervals or arithmetic expression like xθconstant , yθconstant and where θ ∈ {<, >≤, ≥} . 9

SPARQL : http://www.w3.org/TR/rdf-sparql-query/

110

I.A. Amarouche et al.

In our work, queries are formulated in SPARQL and use concepts from the mediated ontology (GDO ontology) and properties from the speciﬁc ontologies (SDO ontologies). Thus, a query can be seen as a graph with two types of nodes; class and literal nodes. Class nodes refer to classes in the ontology. They are linked via object properties. Literal nodes represent data types and are linked with class nodes via data type properties. Figure 6 depicts the RDF graph of the query Q1 described in our scenario. The graph shows that Q1 has four class nodes P , LO, LR, D linked by object property GDO:has-order(P,LO), GDO:has-result(LO,LR) and GDO:indicate(LR,D) respectively. Q ($w1,?z1,?y1,?z2) :GDO:Lab-test-Order (?P Rdf:type GDO:Patient) GDO:Patient (?P GDO:hasSSn $w1) rdf:type (?P GDO:Order ?LO) rdf:type (?LO Rdf:type GDO:lab-test-Order) GDO: hasSSN LO (?LO GDO:hascode $y) P GDO: hascode $w1 GDO: Has-order (?LO GDO:has-result ?LR) (?LR Rdf :type ?Lab-test-result) ?y Has-result (?LR GDO: hasvalue ?y1) rdf:type (?LR GDO:hasminvalue ?vi) GDO:Lab-test(?LR GDO:hasmaxvalue ?vm) GDO:indicate LR rdf:type result (?LR GDO:indicate ?D) D GDO:disease (?LR GDO:hasCode ?C) GDO:hasminvalue GDO:hascode (?C Rdf:type SDO:Lab-code) GDO : hasname GDO : hascode GDO:hasmaxvalue GDO:hasvalue (?C GDO:codeValue ?z1) (?D rdf:type ?disease) A ?vm C ?y1 ?vi ?z2 (?D GDO:hascode ?A) rdf:type GDO:codeValue rdf:type (?A rdf:type ?SDO:diseaseCode) GDO:codeValue SDO: Disease (?A GDO:codeValue ?w2) SDO:Lab-code ?z1 Code (?D GDO:hasname ?z2) ?w2 ?y1 > ?vi ?y1 < ?vm

Fig. 6. Query in the running example

Electronic Health Record Data-as-a-Service model We deem appropriate to follow the work of [6,34,10] to formalize the modeling of EHR DaaS as RDF views over a mediated ontology. Definition 4. EHR DaaS Sj is described as view in a Datalog-like notation over a GDO and SDO thus Sj model is : Sj ($Xj , ?Yj ) : − < Gj (Xj , Yj , Zj ), Coj > |αXj , αYj where: – Xj and Yj are the sets of input and output variables of Sj , respectively, – Gj represents the functionality of the EHR DaaS which is described as a semantic relationship between input and output variables. – Zj is the set of existential variables relating Xj and Yj . – Coj = {Coj1 , ..., Cojn } is a set of constraints expressed on Xj , Yj or Zj variables like xθconstant and yθconstant where θ ∈ {<, >≤, ≥}. – αXj and αYj , named adornment, are a set of RDF triplets describing the semantic (ontological reference, unit...etc) or domain expression of Xj and Yj respectively. Each adornment α is indicated by the 2-tuple; < Cg , Ci > where : Cg : is an SDO concept that represent the diﬀerent conﬂictual aspects Xj and Yj ); Ci : is a concept from SDO inherited from Cg .

EHR Data-as-a-Services Composition Based on Query Rewriting

111

An EHR DaaS model is described over a GDO and adorned by the entities from SDO. As an EHR DaaS is modeled uniquely over the entities of GDO, it does not provides explicit semantics about its input-and output parameters, so we extend its description with additional information describing more precisely how the semantics of the GDO concepts are described according to the SDO ontology. Then, each EHR DaaS model will be expressed as an adorned query [10]. The adornment is an annotation on variables, appearing in input and output parameters of a given EHR DaaS and expressed in term of SDO. Figure 7 gives RDF view of EHR DaaS S21 and S22 services depicted in Table 1 with an adornment depicted in red color. GDO:Lab-test-Result

GDO:Lab-test- Order

S21 ($z,?y) :(?LO Rdf:type GDO: Labo-test-Order) (?LO GDO:hascode ?z) ( ?LO GDO:has-result ?LR) (?LR Rdf:type GDO:Labo-test-Result) (?LR SDO: hascode ?C) (?C rdf:type SDO:LabCode) (?C rdf:type SDO:Loinc) (?C SDO:codeValue ?y)

rdf:type

rdf:type LO

LR

GDO:Has-result GDO:hascode SDO: hascode $z

C rdf:type

SDO:codevalue rdf:type

SDO: LOINC

GDO:Lab-result

GDO:Lab-order

S22 ($z,?y) :(?LO Rdf:type GDO: Labo-test-order) (?LO GDO:hascode ?z) ( ?LO GDO:has-result ?LR) (?LR Rdf:type GDO:Labo-test-Result) (?LR SDO: hascode ?C) (?C rdf:type SDO:LabCode) (?C rdf:type SDO:SNOMED) (?C SDO:codeValue ?y)

?y

rdf:type

rdf:type LO

LR GDO:Has-result

GDO:hascode

SDO: hascode

$z

C rdf:type

SDO: SNOMED

SDO:labCode

SDO:codevalue rdf:type ?y

SDO:labCode

Fig. 7. Electronic Health Record Data-as-a-Service model

Mediation Service Model Mediation Services are also represented as an EHR DaaS model (expressed in term of SDO ontology) whereas their adornments are described as a set of RDF triples that deﬁne the conversion function between peers of SDO : Ci subconcepts from the same SDO : Cg concept in a declarative way. We remind the reader that the diﬀerent SDO : Ci are organized as an ordered list, hence a conversion from one to another is always a concatenation of conversion functions. Definition 5. Mediation Service Sj is modeled as below : Sj ($Ij , ?Oj ) : − < Gj (Ij , Oj , Zj ) > |αF unc ; Where : – $Ij deﬁnes the input parameter required for using mediation service; – ?Oj deﬁnes the output parameter required for using mediation service; – Zj represents variables or constants generated inside a conversion or required during conversion. – αF unc represents the conversion function from SDO : Ij to SDO : Oj . Figure 8 illustrates the RDF view of a mediation DaaS service SLOIN C−SN OMED utilized for converting a labcode from S21 to S3 .

112

I.A. Amarouche et al.

GDO:Lab-test-

GDO:Lab-test-

S3 ($x,$y,?z) :(?LR Rdf:type GDO:Labo-test-Result) (?LR GDO: hasminvalue ?y) (?LR GDO: hasmaxvalue ?z) (?LR SDO:Hascode $A) ( $A Rdf:type SDO:labCode) ( $A Rdf:type SDO:SNOMED) (?A SDO:codeValue $x)

S21 ($z,?y) :Order Result (?LO Rdf:type GDO: Labo-test-Order) rdf:type (?LO GDO:hascode ?z) rdf:type ( ?LO GDO:has-result ?LR) LO LR (?LR rdf:type GDO:Labo-test-Result) GDO:Has-result (?LR SDO: hascode ?C) GDO:hascode SDO: hascode (?C rdf:type SDO:LabCode) (?C rdf:type SDO:Loinc) $z C (?C SDO:codeValue ?y) rdf:type

rdf:type

?y

SLOINC-SNOMED ($y,?x) :(?LR SDO: hascode ?C) (?C rdf:type SDO:LabCode) (?C rdf:type SDO:Loinc) (?C SDO:codeValue ?y) (?LR SDO:Hascode $A) ( $A Rdf:type SDO:labcode) ( $A Rdf:type SDO:SNOMED) (?A SDO:codeValue $x) (SDO:Loinc SDO:hasfunct loinc-snomed)

GDO : hasaxvalue GDO : hasminvalue

SDO: hascode

?y

?z

SDO:codevalue

rdf:type

$X SDO: SNOMED

SDO:labcode

SDO:labcode

Function

LR

A

SDO: LOINC

Output

rdf:type

SDO:codevalue

rdf:type

Input

GDO:Lab-testResult

LR SDO: hascode

C rdf:type

SDO : hascode

SDO:codevalue

rdf:type

?y

$X

SDO: LOINC

rdf:type SDO: SNOMED

SDO:labcode SDO:hasfunct

A

SDO:codevalue rdf:type

SDO:labcode

SDO:hasfunct

Loinc-snomed

Snomed-loinc

Fig. 8. Mediation Service model

5

Query Processing for Electronic Health Record Data-as-a-Services Composition

In this section, we outline the development of query processing for EHR DaaSs composition and we give a detailed description of the key phases of query rewriting and conﬂict detection and resolution phases. 5.1

Query Processing Phases

The complete query processing steps are depicted in Figure 9. They include four processes. First, query formulation and service discovery, second, query rewriting, third, conﬂict detection and resolution, and ﬁnally query execution and result restitution. 1. Firstly, query formulation and service discovery: In (1) and (2) the user issues SPARQL queries in terms of mediated ontology. Doing so, in (3) and (4) the service selector discovers EHR DaaSs from the service registry that partially or completely matches the query entities (class nodes, object property nodes). 2. Secondly, query rewriting: In (5), the query sent over the mediated ontology is rewriten into a query that refers directly to the set of discovered EHR DaaSs. The query rewriting module of EHR DaaSs composition component uses an approach in the spirit of the “bucket algorithm” [18] that returns the maximally contained rewritings of a query. The algorithm computes for each query entity (class node, object property node) called bucket or subgoal in the bucket algorithm, the EHR DaaS that are relevant to it. Thus,

EHR Data-as-a-Services Composition Based on Query Rewriting

Query interface

2

Mediated ontology 3

Service registry

DaaS selector

5

Query rewriting module

DB

6

7,9

12

1

4,8

User

113

14

Detector and resolver

Doc

10 Execution engine Resultat resititution

Interface level

11

Services instances

13 Mediated ontology

EHR DaaS composition

Service level

Data level

Mediated Level

Fig. 9. Electronic Health Record Data-as-a-Services composition process

the rewriting is constructed by combining one element of every bucket. Candidate solutions generated by the query rewriting algorithm must be tested by applying the conjunctive query containment to validate it. 3. Thirdly, conﬂict detection and resolution: In (6), considering each combination generated by the query rewriting module, which may encompasses semantic conﬂicts at the data level (7), requires testing any conﬂict by comparing output and input of subsequent EHR DaaS in each query rewritings. The conﬂict is resolved with the insertion of a call to mediation services (8,9). Thus, besides EHR DaaSs, in each query rewriting combination mediation services are added to resolve conﬂicts. 4. Fourthly, query execution and result restitution: In (10, 11, 12), orchestrating the conﬂict-free composite service that has been generated requires a translation into an execution plan describing the data and control ﬂows. Finally, (in 13 and 14) the result restitution module synthesizes results and returns them to users through user interface. 5.2

Query Rewriting

Given a query Q and a set of EHR DaaSs, the query rewriting module rewrites Q as composition of EHR DaaSs whose union of RDF graphs covers the RDF graph of Q. The query rewriting phase is preceded by a preprocessing step (carried out prior to receiving the composition query) in which the RDF views are extended with the RDFS semantic constraints (i.e. subClassOf, subPropertyOf, Domain and Range) to obtain a better matching with the composition queries.

114

I.A. Amarouche et al.

Our composition query rewriting algorithm [6] has two main phases detailed as follows: (1) Finding the Covered Query’s Sub-graphs: In this phase the query is compared to the diﬀerent RDF views to determine the class nodes and the object properties that are covered by the views. The term covers means that there is a containment mapping between classes nodes and object properties of Q and those of the views. The result of this step is a table summarizing for each EHR DaaS, the covered classes nodes and object properties. For example, the table2 shows the covered classes and properties for the services S11 , S21 , S3 , S4 . Table 2. The covered classes nodes and object properties for the services S11 , S21 , S3 , S4 Service Covered classes and properties S11 ($w1 , ?y) Patient(w1 ), has-order(Patient, LabOrder), LabOrder(y) S21 ($y, ?y1 , ?z1 ) LabOrder(y), hasResult(LabOrder, LabResult), LabResult(y1 ), hasCode(LabResult,LabCode), LabCode(z1 ) S3 ($z1 , ?vi , ?vm ) LabResult(vi , vm ), hasCode(LabResult, LabCode), LabCode(z1 ) S4 ($z1 , ?z2 , ?w2 ) LabCode(z1 ), hasCode(LabResult, LabCode), indicate (LabRehasCode(Disease, DiseaseCode), sult, Disease), Disease(z2 ), DiseaseCode(w2 )

(2) Composition Generation: In the second phase, the query rewriting module combines the diﬀerent lines in the generated table in such a way all the classes nodes and object properties of the query are covered by the combination and the combination is executable. A combination is executable if the inputs of involved services are bound or can be made bound by other services (whose inputs are bound). For example, the combination of the services S11 , S21 , S3 , S4 covers the whole set of classes nodes and properties; all the inputs of theses services are bound or can be made bound. Therefore the set S11 , S21 , S3 , S4 is considered as a valid composition.(see Figure 10 ) Composition 1

S11

S21

S3

S4

Composition 2

S12

S22

S3

S4

Fig. 10. Composition generation

5.3

Conflicts Detection and Resolution

In this phase we detect and resolve the semantics conﬂicts in the EHR DaaS compositions generated in the previous phase. This phase includes the following steps:

EHR Data-as-a-Services Composition Based on Query Rewriting

115

(1) The localization of conﬂicts between interconnected EHR DaaSs. Conﬂicts arise when data elements that have to be exchanged between two interconnected EHR DaaSs are interpreted diﬀerently by these services. Conﬂicts localization is accomplished by the conﬂict detection module. This kind of conﬂict is named Attribute level incompatibilities by the classiﬁcation of structural and semantic message level heterogeneities proposed in [29]. According to that work, attribute level incompatibilities arise when semantically similar attributes are modeled using diﬀerent descriptions. To detect the conﬂicts we deﬁne some rules that deﬁne some conﬂicts between EHR DaaSs expressed in RDF as indicated bellow: – let SDO : Ri and SDO : Ei be subclasse of the same conﬂictual class SDO : Cg , such as SDO : lab − code, thus : SDO:Ei rdfs:subClassof SDO:Cg SDO:Ri rdfs:subClassof SDO:Cg – Then, if we have two EHR DaaS Si and Sj including concepts SDO : Ei and SDO : Ri respectively in their RDF descriptions as an adornment, expressed as depicted by the following triples : LR ?N ?N LR ?A ?A

SDO:hascode ?N rdf:type SDO:Ei rdf:type SDO:LabCode SDO:hascode ?A rdf:type SDO:Ri rdf:type SDO:labCode

then we have code laboratory test conﬂict. In sum, the set of conﬂict types identiﬁed in our solution is the set of conﬂictual concepts SDO : Cg . For instance, code disease conﬂict or unit conﬂict. Other conﬂict types ( e.g. data representation, data precision ,... etc) can be added to SDO in order to resolve more semantic conﬂicts at the data level. As a consequence, this step will provide the set of conﬂict objects where each conﬂict object will be identiﬁed as 3 tuple < O(Si ), I(Sj ), Cg ) > where O(Si ) is an adorned output parameter of a given EHR DaaS source Si , I(Sj ) is an adorned input parameter of a given EHR DaaS target Sj , conﬂict type is a member of set of conﬂictual concepts Cg = {unit, labCode, ....., n}. To detect the conﬂicts, the algorithm depicted (algorithm 1) will take each composition (represented as Directed Acyclic Graph) and iteratively verify the rules expressed previously for each parameter (adornment only) exchanged between interconnected services to ﬁnd out all possible conﬂicts which will be stored in the conﬂicts-objects set. (2) The conﬂict objects detected previously will be resolved by the automatic invocation of an appropriate mediation service. The latter is identiﬁed through: – the input parameter SDO : O(Si ), which is the output of Si ; – the output parameter SDO : I(Sj ) which is the input of Sj ; – the conversion function as an adornment, deﬁned as property of SDO : O(Si ) and targets SDO : I(Sj ).

116

I.A. Amarouche et al.

Algorithm 1. semantic conﬂict detection and resolution Require: Mi,j {Matrix is a graph of EHR DaaS combination with conﬂict}; i, j, k, z ∈ N, CO a set of Conﬂict Object, 1: {Detection conﬂict step} 2: for i = 1 to n do 3: for j + 1 to n do 4: if M [i][j] = 1 then 5: if Output.Si AND Input.Sj have the same conﬂictual concept as type and diﬀerents SDO subclasses then 6: COz = New conﬂict object( output.Si , input.Sj , conﬂictual concept), 7: Add (CO , COz ) 8: end if 9: end if 10: end for 11: end for 12: {Resolution conﬂict step} 13: for each COz in CO do 14: {according to conﬂict object identify mapping function (Output.Si ,Input.Sj ) from SDO ontology} 15: M [i, j] = 0 {delete Si and Sj arc} 16: ADD SK {ADD mediation DaaS service} 17: end for graph of EHR DaaS combination without conﬂict Ensure: Mi,j

As a consequence of this phase, the mediation services SLOIN C−SN OMED and Smmol/l−mg/l are added to the ﬁrst and second EHR DaaS compositions to resolve conﬂict as depicted in ﬁgure (see Figure 11). Afterwards, the obtained conﬂict-free compositions will be translated into execution plans (i.e. orchestrations) describing the data and control ﬂows as depicted in the same Figure 11. For space limitation, we do not detail this step in the paper.

6

Implementation and Evaluation

To illustrate the viability of our approach to EHR DaaS composition, we implemented about /411/ EHR DaaS Web services on top of a set of medical data sources containing synthetic data about patients, including information like diseases, medical tests, allergies, medications lists, vaccination records, ongoing treatments, consultations, personal information (e.g., date of birth, sex, etc), etc. All of these data are usually represented by the commonly used types of the EHR information model. We built a medical ontology based on the building blocks and the data-types deﬁned in the HL7 and the openEHR standards. The ontology included /81/ ontological concepts and /413/ properties (i.e., both datatype and object properties). We modeled all services as RDF views over that ontology. These views were used to annotate the description ﬁles (WSDLs) of corresponding DaaS services. We implemented also a set of mediation services; these services were used to convert the values of exchanged data from HL7 to

EHR Data-as-a-Services Composition Based on Query Rewriting

Conflict detection

Conflict resolution

Query plan execution

Composition 1

117

Labcode conflict S11

S21

S3 Unit conflict

S4

Composition 2

S12

S22

Composition 1

S11

S21

S(LOINC-SNOMED)

S3

S4

Composition 2

S12

S22

S(mmol/l-mg/l)

S3

S4

S11

S21

S(LOINC-SNOMED)

S3

S4

S12

S22

S(mmol/l-mg/l)

S3

Composition

S4

Fig. 11. Conﬂict detection and resolution

openEHR speciﬁc datatypes (and vice versa). These services allowed for example to convert between diﬀerent medical data’s measurements units, precisions, etc. All services were implemented in Java, and hosted on a GlassFish Web server. Figure 12 depicts our implemented evaluation system. In that ﬁgure, the DaaS Web services layer plays the role of an abstraction layer on top of heterogeneous medical data sources; i.e., medical data located in heterogeneous data sources (e.g., relational data bases, silos of data-centric homegrown or packaged applications (e.g. SAP, PeopleSoft, Siebel, etc), ﬁles XML ﬁles, etc.) are all accessed by the same interface, the Web service interface. These services can be composed by the Web Service Management System. The system’s users are assisted in formulating their queries (i.e., SPARQL queries) over the ontology. We tested our system with a set of real-life queries (including that of the running example), examples included: “Q1 : check whether the medication ABC identiﬁed by the code “801” to be prescribed to patient John Doe interacts with the ones currently taken by that patient”, “Q2 : For any given social security number X of a patient and a medication code Y representing the medication to be prescribed, verify whether the medications taken by the patient may interact with Y”, “Q3 : What are the tests performed by patients that have been administered a given medication?”, etc. Throughout our tests, we made the following observations: (i) the system was able to process hundreds of services in a reasonable time (411 services in less than one second), the reported time is the time to create the composite services (this involves parsing the WSDL ﬁles of services, determining the relevant ones and building the composition); the created compositions included both the EHR DaaS services and the necessary mediation services. (ii) in all of the considered queries (20 queries), the system was able to insert the necessary data mediation services to transform data between heterogeneous component services in a composition. The system users (mainly physicians in our tested examples) have expressed their satisfaction, as they were able to answer their queries on the ﬂy without any programming involved. Users used the created compositions in their daily clinical scenarios (e.g., prescribing a medication, studding the risks of a medication, etc.).

118

I.A. Amarouche et al.

Composition plan generator

RDF query rewriter

Client

RDFS Domain ontology

SPARQL Query Results / An interface to the composite service

Conflict Detecter and Resolver Web Service Management System

Execution engine Interactive query formulator

Service locator

UP Cast/Down Cast messages Transformer

UDDIs SOAP Messages WSDL-S Files

Sa Annotating WSDLs files with views over domain ontologies

Sf

Sn

Si

Sb

Sk

DaaS Web Services

Sj

WSDLs

RDF Views

Proprietary Applications

RDFS Domain ontology DB

DB

DB

DB

DB

DB

DB

Medical Data

Fig. 12. The implemented system for evaluating our approach

7

Related Works

In this section, we give an overview of the main research works related to our subject. We have categorized these works into the following areas: EHR data integration, EHR Web service composition and query rewriting approach for automatic Web service composition. 7.1

Electronic Health Record Data Integration

Data integration is the problem of combining data residing at diﬀerent sources to provide the user with a uniﬁed view of data. There is a large body of research work on data integration not only in the area of computer science but also in health and biomedical informatics[1] [11]. Broadly speaking data integration approaches can be classiﬁed into two main approaches: “data warehouses” and “mediation” approaches. The mediation approach, unlike data warehouse, does not materialize data; it integrates data at the query processing time. In the e-health area most data integration projects have adopted this approach like, to name a few, Synapses [16], Synex [16] and Pangea-LE [1], etc. Most of these projects provide a global XML schema over structured XML views of EHR documents stored for a patient in existing health data repositories. The mediated system accepts requests for data from clients, decomposes them into queries against the connected data sources, and integrates the responses dynamically. In contrast with these projects, we

EHR Data-as-a-Services Composition Based on Query Rewriting

119

adopt a service-oriented data mediation architecture and a service composition approach to integrate data on the ﬂy by composing autonomous EHR DaaSs. Also, unlikely to these projects we handle the semantic conﬂicts at the data value level when data is exchanged among healthcare systems. 7.2

Medical Web Services Composition

A large number of research work have addressed the problem of WS composition in the healthcare application domain[2], [28], [39] and [21]. However, the bulk of these works have focused only on workﬂow oriented compositions; i.e. compositions that implement the diﬀerent steps involved in a given business task (e.g. patient admission in a hospital, patient discharge, etc). We review some of these works in the following: Authors in [2] deﬁned a model-driven approach for semi-automatic Web service orchestration with run-time binding in the healthcare domain. Information related to medical Web services can be found in the corresponding standardization documents for instance HL7, DICOM and IHE. Unfortunately, WS composition in that work does not allow to integrate heterogeneous medical data sources. Budgen et al. [9] propose data Integration Broker for Heterogeneous Information Sources (IBHIS). The proposed broker adopts a service-based model to query data at autonomous healthcare agencies. The broker achieves its goals through the use of semantic data descriptions, a semantic registry and a query engine. The semantic registry is based upon an extended form of UDDI, incorporating a matchmaker to match OWL-S data descriptions from the registry to the users’ queries. The query engine formulates the users’ queries, interacts with the matchmaker to answer the query, and displays the ﬁnal results to users. Unlink to our work, in that work there is no way to compose services to address user’s complex queries (i.e. the work assumes that a query can be always resolved by one service). Furthermore, the use of OWL-S language alone does not allow to specify explicitly the relationship between inputs and outputs of a DaaS service which may lead to errors in the service matching phase. Further, IBHIS relies on SNOMED as a mediated ontology which is a restriction, as in our vision we should remain independent from a speciﬁc ontology. Hristoskova et al. [20] present an implementation of a dynamic and automatic composer for medical support services in the ICU (The Intensive Care Unit). The composition is achieved by semantically described Web services in order to provide automatic WS-BPEL composition. In comparison with our work, this solution does not address the EHR data integration using WS composition. Also, it is restricted to intensive care unit in hospital where data heterogeneity is not a real concern. In ARTEMIS project [12] [8] ensures the semantic interoperability of Electronic Health Records and medical applications through using the Web service technology. It presents a mechanism for publishing, discovering and invoking semantically enriched Web services in Peer-to-Peer medical data sharing environments. Web services are annotated with OWL medical ontologies. However,

120

I.A. Amarouche et al.

the ARTEMIS project does not provide means to compose individual medical data services to answer the user’s complex queries. Furthermore, heterogeneities at the data value level was not addressed in that work. 7.3

Query Rewriting Approach for Web Service Composition

Automatic Web service composition approaches can be classiﬁed according to the techniques adopted to solve the composition problem into: AI planning based approaches, workﬂow based approaches and query rewriting based approaches [6,40,27,5,36,38,32]. We review in the following some works in the last category as they relate to our approach. Lu et al. [27] provide a framework for answering queries with a conjunctive plan that includes input and outputs of participating Web services annotated with Datalog expressions. In [36], a combination of inverse rules query reformulation algorithm and tuple ﬁltering are used to generate a universal integration plan or a composition to answer user queries. However, those works do not take into account the semantics of the services during the matching phase; i.e. services are matched based on types matching. In order to address this problem, Bao et al. [5] proposed a semantic query rewriting approach using the CARIN [15] language. However, unlike to our work the semantic query rewriting in that work is limited to one-to-one matching; i.e. the work assumes that a query can be resolved by one service and there is now need to service combination. Also, Zhou et al. [40] introduce an ontology-based approach to publishing and composing data-intensive Web services. They propose an extension to the description capability of OWL-S. This extension has the form of a SPARQL query deﬁning semantic content and constraints on data published by the service. Also, an algorithm to generating service composition based on ontology language and graph-based planning are outlined. However, our work relies on a more eﬃcient RDF query rewriting algorithm [6] that uses many optimization heuristics to out speed their proposed algorithm. Furthermore, we address data values heterogeneities in the obtained compositions. Furthermore, Vaculin et al. [38] describes mechanisms for speciﬁcation of generic data providing services using RDF views. They provided a characterization of matching conditions for DaaS services and developed an algorithm for matching with calculation of a matching degree between service requests and service advertisements. However, that work overlooks the relationships between matched outputs and inputs in the RDF graph of the DaaS service which may lead to erroneous matching results.

8

Perspectives and Conclusions

In this paper, we proposed an approach to automatically compose EHR DaaS Web services published by heterogeneous health information systems that employ diﬀerent EHR data standards. The proposed approach follows a local-asview paradigm by explicitly requiring a two-level mediated ontology. The ﬁrst

EHR Data-as-a-Services Composition Based on Query Rewriting

121

level models the generic data concepts and their inter-relationships while the second allows specifying in a declarative way how a concept of the ﬁrst level is represented in diﬀerent health ontologies and data standards. These ontologies are utilized to annotate EHR DaaS and mediation services on the one hand, and to specify user queries on the other hand. User queries are rewritten in terms of EHR DaaS services using an eﬃcient query rewriting algorithm. Furthermore, our approach makes use of mediation services to handle the semantic heterogeneity of exchanged data. As future work, we intend to improve our mediation approach in order to address structural-level incompatibilities, as well as complex data transformations between input and output parameters of EHR DaaSs. In order to realize this objective, we plan to study diﬀerent solutions for the composition of mediation services to ensure a complete mediation in EHRs DaaS composition. In addition, we intend to include that data quality aspects (e.g. data trustworthiness and provenance, etc) of data provided by the EHR DaaSs in our composition approach. We will investigate the use of RDF reiﬁcation and named graphs in that respect.

References 1. Angulo, C., Crespo, P., Maldonado, J.A., Moner, D., P´erez, D., Abad, I., Mandingorra, J., Robles, M.: Non-invasive lightweight integration engine for building ehr from autonomous distributed systems. International Journal of Medical Informatics 76(suppl.3), 417–424 (2007) 2. Anzb¨ ock, R., Dustdar, S.: Modeling and implementing medical web services. Data & Knowledge Engineering 55(2), 203–236 (2005) 3. Arguello, M., Des, J., Perez, R., Fernandez-Prieto, M., Paniagua, H.: Electronic health records (ehrs) standards and the semantic edge: A case study of visualising clinical information from ehrs. In: International Conference on Computer Modeling and Simulation, pp. 485–490 (2009) 4. Austin, T., Kalra, D., Tapuria, A., Lea, N., Ingram, D.: Implementation of a query interface for a generic record server. I. J. Medical Informatics 77(11), 754–764 (2008) 5. Bao, S., Zhang, L., Lin, C., Yu, Y.: A semantic rewriting approach to automatic information providing web service composition. In: Mizoguchi, R., Shi, Z.-Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 488–500. Springer, Heidelberg (2006) 6. Barhamgi, M., Benslimane, D., Medjahed, B.: A query rewriting approach for web service composition. IEEE Transactions on Services Computing 3, 206–222 (2010) 7. Beale, T.: Archetypes: Constraint-based Domain Models for Futureproof Information Systems 8. Bicer, V., Laleci, G.B., Dogac, A., Kabak, Y.: Artemis message exchange framework: semantic interoperability of exchanged messages in the healthcare domain. SIGMOD Rec. 34, 71–76 (2005) 9. Budgen, D., Rigby, M., Brereton, P., Turner, M.: A data integration broker for healthcare systems. Computer 40, 34–41 (2007)

122

I.A. Amarouche et al.

10. Calvanese, D., Giacomo, G.D., Lenzerini, M., Nardi, D., Rosati, R.: A principled approach to data integration and reconciliation in data warehousing. In: Proceedings of the International Workshop on Design and Management of Data Warehouses, DMDW 1999 (1999) 11. Cruz-Correia, R., Vieira-Marques, P., Ferreira, A., Almeida, F., Wyatt, J., CostaPereira, A.: Reviewing the integration of patient data: how systems are evolving in practice to meet patient needs. BMC Medical Informatics and Decision Making 7(1), 14 (2007) 12. Dogac, A., Laleci, G.B., Kirbas, S., Kabak, Y., Sinir, S.S., Yildiz, A., Gurcan, Y.: Artemis: Deploying semantically enriched web services in the healthcare domain. Information Systems 31(4-5), 321–339 (2006); The Semantic Web and Web Services 13. Dolin, R., Alschuler, L., Beebe, C., Biron, P., Boyer, S., Essin, D., Kimber, E., Lincoln, T., Mattison, J.: The HL7 clinical document architecture. Journal of the American Medical Informatics Association 8(6), 552 (2001) 14. Eichelberg, M., Aden, T., Riesmeier, J., Dogac, A., Laleci, G.B.: A survey and analysis of electronic healthcare record standards. ACM Comput. Surv. 37(4), 277– 315 (2005) 15. Goasdou´e, F., Latt`es, V., Rousset, M.-C.: The use of carin language and algorithms for information integration: The picsel system. Int. J. Cooperative Inf. Syst. 9(4), 383–401 (2000) 16. Grimson, J., Stephens, G., Jung, B., Grimson, W., Berry, D., Pardon, S.: Sharing health-care records over the internet. IEEE Internet Computing 5, 49–58 (2001) 17. Halevy, A.Y.: Theory of answering queries using views. SIGMOD Rec. 29(4), 40–47 (2000) 18. Halevy, A.Y.: Answering queries using views: A survey. The VLDB Journal 10, 270–294 (2001) 19. Hori, M., Ohashi, M.: Applying xml web services into health care management. In: Hawaii International Conference on System Sciences, vol. 6, p. 155a (2005) 20. Hristoskova, A., Moeyersoon, D., Hoecke, S.V., Verstichel, S., Decruyenaere, J., Turck, F.D.: Dynamic composition of medical support services in the icu: Platform and algorithm design details. Computer Methods and Programs in Biomedicine 100(3), 248–264 (2010) 21. Kart, F., Miao, G., Moser, L., Melliar-Smith, P.: A distributed e-healthcare system based on the service oriented architecture. In: IEEE International Conference on Services Computing, SCC 2007, pp. 652–659 (2007) 22. Katehakis, D.G., Sfakianakis, S., Kavlentakis, G., Anthoulakis, D.N., Tsiknakis, M.: Delivering a lifelong integrated electronic health record based on a service oriented architecture. IEEE Transactions on Information Technology in Biomedicine 11(6), 639–650 (2007) 23. Lee, Y., Patel, C., Chun, S.A., Geller, J.: Towards intelligent web services for automating medical service composition. In: Proceedings of the IEEE International Conference on Web Services, ICWS 2004, p. 384. IEEE Computer Society Press, Washington, DC, USA (2004) 24. Lenz, R., Beyer, M., Kuhn, K.A.: Semantic integration in healthcare networks. International Journal of Medical Informatics 76(2-3), 201–207 (2005); Connecting Medical Informatics and Bio-Informatics - MIE 2005 25. Liu, Q., Huang, T., Liu, S.-H., Zhong, H.: An ontology-based approach for semantic conﬂict resolution in database integration. Journal of Computer Science and Technology 22, 218–227 (2007), doi:10.1007/s11390-007-9028-4 26. Liu, S., Ni, Y., Mei, J., Li, H., Xie, G., Hu, G., Liu, H., Hou, X., Pan, Y.: ismart: Ontology-based semantic query of cda documents

EHR Data-as-a-Services Composition Based on Query Rewriting

123

27. Lu, J., Yu, Y., Mylopoulos, J.: A lightweight approach to semantic web service synthesis. In: WIRI 2005, Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, pp. 240–247 (2005) 28. Mykk¨ anen, J., Riekkinen, A., Sormunen, M., Karhunen, H., Laitinen, P.: Designing web services in health information systems: From process to application level. International Journal of Medical Informatics 76(2-3), 89–95 (2005) 29. Nagarajan, M., Verma, K., Sheth, A., Miller, J.: Ontology driven data mediation in web services. International Journal of Web Services Research 4(4), 104–126 (2007) 30. Orgun, B., Vu, J.: Hl7 ontology and mobile agents for interoperability in heterogeneous medical information systems. Computers in Biology and Medicine 36(7-8), 817–836 (2006); Special Issue on Medical Ontologies 31. Ram, S., Park, J.: Semantic conﬂict resolution ontology (scrol): an ontology for detecting and resolving data and schema-level semantic conﬂicts. IEEE Transactions on Knowledge and Data Engineering 16(2), 189–202 (2004) 32. Rao, J., Su, X.: A survey of automated web service composition methods. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, pp. 43–54. Springer, Heidelberg (2005) 33. Sachdeva, S., Bhalla, S.: Semantic Interoperability in Healthcare Information for EHR Databases. In: Kikuchi, S., Sachdeva, S., Bhalla, S. (eds.) DNIS 2010. LNCS, vol. 5999, pp. 157–173. Springer, Heidelberg (2010) 34. Saleh, I., Kulczycki, G., Blake, M.B.: Demystifying data-centric web services. IEEE Internet Computing 13, 86–90 (2009) 35. Stroetmann, K., Stroetmann, V.: Towards an Interoperability Framework for a European e-Health Research Area–Locating the Semantic Interoperability Domain. In: EC Workshop on semantic interoperability, Brussels, pp. 14–15 (February 2005) 36. Thakkar, S., Ambite, J.L., Knoblock, C.A.: A data integration approach to automatically composing and optimizing web services. In: Proceedings of the ICAPS Workshop on Planning and Scheduling for Web and Grid Services (2004) 37. Truong, H.L., Dustdar, S.: On analyzing and specifying concerns for data as a service. In: APSCC, pp. 87–94 (2009) 38. Vacul´ın, R., Chen, H., Neruda, R., Sycara, K.: Modeling and discovery of data providing services. In: Proceedings of the 2008 IEEE International Conference on Web Services, pp. 54–61. IEEE Computer Society, Washington, DC, USA (2008) 39. Wright, A., Sittig, D.F.: Sands: A service-oriented architecture for clinical decision support in a national health information network. J. of Biomedical Informatics 41, 962–981 (2008) 40. Zhou, L., Chen, H., Wang, J., Zhang, Y.: Semantic web-based data service discovery and composition. In: International Conference on Semantics, Knowledge and Grid, pp. 213–219 (2008)

A Modular Database Architecture Enabled to Comparative Sequence Analysis Paola Bonfante1 , Francesca Cordero2,3, Stefano Ghignone1 , Dino Ienco2 , Luisa Lanfranco1, Giorgio Leonardi4 , Rosa Meo2 , Stefania Montani4 , Luca Roversi2, and Alessia Visconti2 1

3

Dipartimento di Biologia Vegetale, Universit` a di Torino, Italy 2 Dipartimento di Informatica, Universit` a di Torino, Italy Dipartimento di Scienze Cliniche e Biologiche, Universit` a di Torino, Italy 4 Dipartimento di Informatica, Universit` a del Piemonte Orientale, Italy {fcordero,ienco,meo,roversi,visconti}@di.unito.it, {giorgio.leonardi}@mfn.unipmn.it, {stefania.montani}@unipmn.it, {pbonfant,sghignon,llanfran}@unito.it

Abstract. The beginning of post-genomic era is characterized by a rising numbers of public collected genomes. The evolutionary relationship among these genomes may be caught by means of the comparative analysis of sequences, in order to identify both homologous and non-coding functional elements. In this paper we report on the on-going BIOBITS project. It is focused on studies concerning the bacterial endosymbionts, since they oﬀer an excellent model to investigate important biological events, such as organelle evolution, genome reduction, and transfer of genetic information among host lineages. The BIOBITS goal is two-side: on the one hand, it pursues a logical data representation of genomic and proteomic components. On the other hand, it aims at the development of software modules allowing the user to retrieve and analyze data in a ﬂexible way.

1

Introduction

Genomics and post-genomics studies which have bloomed in the last decade are oﬀering new tools for applied biotechnological research in several ﬁelds from medical, pharmaceutical to industrial and environmental. Sequencing of the human genome has generated a great deal of interest in the diagnosis and treatment of diseases using genomic medicines. Structural genomics approaches covering topologically similar proteins or gene families are great assets for progress in the development of novel therapeutics. In addition the genomic analysis of microbial communities in a culture-independent manner (metagenomics) has also given the opportunity to probe and exploit the enormous resource represented by still underscribed microbial diversity. A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 124–147, 2011. c Springer-Verlag Berlin Heidelberg 2011

A Modular Database Architecture

125

This paper is an extension of a work already published [12]. It describes the on-going project BIOBITS1 that aims at performing an extensive comparative genomic studies in order to answer fundamental questions concerning the biology, ecology and evolutionary history. The speciﬁc goal of BIOBITS is to get insights on the tri-partite system, constituted by (i) a bacterial endosymbiont of an arbuscular mycorrhizal (AM) fungus, (ii) AM fungi living in plant roots, and (iii) plant roots. Bacterial endosymbionts are widespread in the animal kingdom, where they oﬀer excellent models for investigating important biological events such as organelle evolution, genome reduction, and transfer of genetic information among host lineages [30]. By contrast, examples of endobacteria living in fungi are limited [26] and those best investigated live in the cytoplasm of AM fungi [9]. AM fungi are themselves obligate symbionts since, to complete their life cycle, they must enter in association with the root of land plants. AM species belonging to the Gigasporaceae family harbour an homogeneus population of endobacteria which have been recently grouped into a new taxon named Candidatus Glomeribacter gigasporarum [7]. The AM fungus and its endobacterium Ca. Glomeribacter gigasporarum are currently used as a model system to investigate endobacteria-AM fungi interactions. The project takes advantages by the employment of a massive large-scale analysis and genomic comparison study of phylogenetically related free-living bacteria. Moreover, the comparison with genomes of other endosymbionts species will provide insights about the reason of the strict endosymbiotic life-style of this bacterium. Another aspect taken into account is the analysis of metabolic pathways. A strong reason of interest in this project is based on the assumption that the symbiotic consortia may lead to the discovery of molecules of interest for the development of novel therapies and other applications in biotech. In this paper we report speciﬁcally on a step of BIOBITS whose goal, roughly, is the development of a modular database which allows to import, to store, and to analyze massive genomic data. Later in BIOBITS we will extensively develop a computational genomic comparison focused on the above bacterium and fungi genomes. BIOBITS deploys a data warehouse that stores in a multi-dimensional model the interesting components of the project. Such a component should have the following characteristics: i) being able to store genomic data from multiple organisms, possibly taken from diﬀerent public database sources; ii) annotating the genomic data making use of the alignment between the given sequences and the genomic sequences of other similar organisms; iii) annotating the genomic sequences and the protein transcript products by the full use of ontologies developed by the biology and bioinformatics communities; iv) comparing and visually presenting the results of the genomic alignment; iv) being able to cluster genomic or proteomic data coming from diﬀerent organisms. The aim is at ﬁnding easily 1

BIOBITS is a project funded by Regione Piemonte under the Converging Technologies Call. BIOBITS involves Universit` a di Torino, Universit` a del Piemonte Orientale, CNR and the companies ISAGRO Ricerca s.r.l., GEOL Sas, Etica s.r.l.

126

P. Bonfante et al.

increasing levels of similarity and induce on one side the steps of the phylogenetic evolution and on the other side investigate on the metabolic pathways. As a matter of fact, we wish to take advantage of the possibilities oﬀered by computer science technology and its methodologies to analyse the genomic data the project will produce. The analysis of genomic data requires computational tools that allow to “navigate” ﬂexibly data from arbitrary (at least in principle) user deﬁned perspectives and under diﬀerent degrees of approximation. In this paper we describe the BIOBITS system architecture in terms of BIOBITS Data Mart and BIOBITS modules. With respect to the previous publication [12], we report a detailed description of two modules, namely Case Base Reasoning and Co-clustering modules, that have been developed to perform a comparative genomic analysis. Moreover, we show the results obtained in a case study by the use of the system. The case study shows the utility and ﬂexibility of an integrated system whose modules allow to retrieve and analyze diﬀerent portions of data, at the granularity level that is needed by the user. This ﬂexibility eliminates the necessity of perform any pre-processing to the data in order to adapt it to the analysis algorithm and to the user’s goal. In the presented case study we extracted a set of biological sequences belonging to the organism under investigation by following the BIOBITS Data Mart star schema. BIOBITS project focus on the identiﬁcation of the evolutionary relationships among species more similar to Ca. G. gigasporarum. Using the Case Base Reasoning module, we retrieved sequences that are similar to the given organism. Retrieval is performed according to the suitable abstraction level over the data given by a taxonomy of granularities. Finally, to the resulting sequences we applied the Co-clustering module and we were able to identify protein domains common among the sequences.

2

Related Works

There is a wide variety of approaches in designing tools to analyze biological data. Experience suggests that the best way to data analysis is to set up a database. An ’historical’ example is ACeDB (A C. elegans Database [1]), one of the ﬁrst hierarchical, rather than relational, model organism databases. Another example is ArkDB [21], a schema that was created to serve the needs for the subset of the model organism community interested in agriculturally important animals. ArkDB has been successfully used across diﬀerent species by diﬀerent communities, but is rarely used outside the agricultural community. On “top” of databases a great variety of applications is available, from those ones for the annotation community to molecular pathway visualization, or from the work-ﬂow management to the comparative genome visualization. Currently, there is a rich community and many available software tools built around MAGE [27] and GMOD [33]. GMOD stands for Generic Model Organism Database project, which brought to the development of a whole collection of software tools for creating and managing genome-scale biological databases, in the forthcoming description. In the BIOBITS project GMOD and its database Chado have been selected as the data elaboration and management center.

A Modular Database Architecture

2.1

127

GMOD and Chado Database

The BIOBITS software architecture is built upon a layer provided by GMOD system. We report here the main motivations that lead to this choice. The design and implementation of database applications is time consuming and labor-intensive. When database applications are constructed to work with a particular schema, changes to the database schema require in turn changes to the software. Unfortunately, these changes are frequent in real projects due to changes in requirements. In particular they are frequent in bioinformatics. Most critical are the changes in the nature of the underlying data, which follow the current understanding of the natural world. Additional requirements are placed by the rapid technological changes in experimental methods and materials. Finally, the wide variety of biological properties in the organisms species always has made diﬃcult to create a unique model schema valid for all the species. All the above outlined motivations led to the design of Chado database model which is a generic and extensible model, whose software is available under an open source delivery policy. Chado schema can be employed as the core schema of any model organism data repository. This common schema increases interoperability between software modules that operate on it. Chado data population is driven by ontologies, i. e. controlled vocabularies. Ontologies give a typing to the entities with the result of partitioning the whole schema into subschemas, called modules. Each module encapsulates a diﬀerent biological domain and uses an appropriate ontology. An ontology characterizes the diﬀerent types of entities that exist in a world under consideration by means of primitive relations. These primitives are easy to understand and to use, they are expressive and consistent, and they allow the reasoning about the concepts under representation. Typical examples of ontological relations are: (i) is a which expresses when a class of entities is a subclass of another class, and (ii) part of which expresses when a component constitutes a composite. Many other relation types are discussed in [15]. Concerning the schema of Chado it is worth remarking feature and sequence entities. feature allows both data and meta data; it can be populated by instances each determining the type of every other instance in the schema, in accordance with the ontology SO [15]. sequence contains biological sequence features, that include genetically encoded entities like genes, their products, exons, regulatory regions, etc. . . feature and sequence are further described by properties.

3

BIOBITS System Architecture

Here we deepen the description of the system which is designed to manage all the information and all the in-silico activities in the context of the project BIOBITS. This system is implemented through a modular architecture, described in detail in Section 3.2. The system architecture permits (1) to store and access locally all the information regarding the organisms to be studied, and (2) to provide algorithms and user interfaces to support the researchers’ activities like: (i) searching and retrieving genomes, (ii) comparing and aligning with a genome

128

P. Bonfante et al.

of reference, (iii) investigating syntenies, and (iv) locally storing potentially new annotations. The system architecture has been engineered exploiting the standard modules and interfaces oﬀered by the GMOD project [33], and completed with custom modules to provide new functionalities. The main module of the system contains the database which provides all the data needed to perform the in-silico activities related to the project. Thanks to the adoption of Chado database schema, on the one hand, we take advantage of its support in controlled vocabularies and ontologies. On the other hand, Chado is the standard database for most of the GMOD modules; therefore we can reuse these modules to support the main activities of the project and extend the system incrementally as the researchers’ needs evolve. An example, is the possibility to use BioMart Chado’s module which helps the user to identify the relevant dimensions of the problem, their hierarchies and to transform and import input data in the data warehouse conforming them in a typical star schema. 3.1

Star Schema in BIOBITS Data Mart

Essential in the data warehouse is the logical star schema of the stored data. The star schema deﬁnes the dimensions of the problem. Often, each dimension of the star schema can be viewed at diﬀerent abstraction levels. The levels are organized in a hierarchy. Finally, the central entity in the star schema collects the main facts or events of interest. In the case of the BIOBITS project, there are two star schemas. 1. The star built around the genome composition facts. It represents the composition of each genome in terms of genes and chromosomes and with reference to the belonging organism. 2. The star schema around protein facts. It describes the proteins in terms of PROSITE domains and with respect to the dimensions of phylogenetic classiﬁcation and metabolic pathways. The genes and proteins facts are linked by the relationship representing the encoding. For most of the dimensions, such as genes and phylogenetic classiﬁcation, the scientiﬁc literature already has provided ontologies (e.g., Gene Ontology, GO) and controlled vocabularies (Clusters of Orthologous Groups, COG) that are available in public domain databases and are imported in the system. Another example of available hierarchy on the genes and proteins are the family organizations. In the following we describe the BIOBITS Data Mart schema (shown in Figure 1) in detail. Genome Composition. It includes all the relevant information about a genome fragment. Considering a fragment view of the genome, genome composition includes all the known fragments composing a genome: it reports the precise boundaries of the fragments (which depend on the user experience and

A Modular Database Architecture

129

Fig. 1. Star schema of BIOBITS Data Mart

discoveries), the start position and the fragment order with respect to the genome, its nucleotide sequence and strand. Chromosome/Plasmid DNA. It speciﬁes the localization of the fragment expressed by the number or the name of the corresponding chromosome/plasmid location. Indeed, the genome could be inserted either in a chromosome sequence or in a plasmid sequence. Organism. It speciﬁes both endosymbiotic and ectosymbiotic bacteria. An organism is identiﬁed by the speciﬁed identiﬁer, includes the organism scientiﬁc name and its classiﬁcations in the taxonomy database. Gene Annotation. It consists in a short report of gene-speciﬁc information (identiﬁer and name), comprehensive of a brief description of gene products using both the information reported in Gene Ontology, and the main references stored in Pubmed. Gene Synonymous. It contains all the synonymous names associated to each gene. Genes and proteins are often associated to multiple names; additional names are included as new functional or structural information are discovered. Since authors often alternate between synonyms, computational analysis beneﬁts from collecting synonymous names. Gene Family. Following the gene classiﬁcation into families, consistent to the genes biochemical similarity, it reports the family identiﬁers.

130

P. Bonfante et al.

Product. It is a class of the products that genes codify. Products are categorized into in three classes: transfer RNA (tRNA), ribosomal RNA (rRNA) and proteins. Moreover, it reports a pseudogene indication if the gene has lost its coding ability. tRNA. Transfer RNA is a small RNA molecule that transfers a speciﬁc active amino acid to a growing polypeptide chain. rRNA. Ribosomal RNA is the central component of the ribosome. The ribosome is a complex of ribosomal RNA and ribonucleoproteins. Metabolic Pathways. It represents pathways which are composed by a set of biochemical reactions. Each pathway represents the knowledge on the molecular interactions and reactions network. Protein. It refers to protein-speciﬁc information (protein identiﬁer and name). A protein is a set of organic compounds (polypeptides) obtained by transcription and translation of a DNA sequence. Phylogenetic Classification. It consists of Cluster of Ortologous Groups (COG) of protein sequences encoded in a complete genome. Domain Name. It reports the domains extracted from PROSITE database [22], characterizing the protein sequence. PROSITE consists of documentation entries describing protein domains, families and functional sites. The relationship among proteins and domains is characterized by the attribute order describing how the domains that compose a speciﬁc protein are sorted. 3.2

System Architecture

Figure 2 summarizes the main architecture of the BIOBITS system. In the following we focus on objectives and features of the BIOBITS system. Local and global access to data. The instance of Chado we want to set up will contain both data on genome we shall explicitly produce as part of the project BIOBITS and data retrieved from the biological databases accessible through the Internet. The Import modules in Figure 2 will accomplish such a requirements. Concerning the retrieval from Internet, RRE - Queries is a GUI wizard, built on the basis of a previously published tool [24], able to query diﬀerent biological databases like for example GenBank [19] and able to convert the results of the queries into standard formats. Alternatively, we can convert the format of data retrieved from Internet thanks to the scripts available as part of the GMOD project. A remarkable example are those scripts that convert GenBank genes annotations into the Generic Feature Format (GFF), adopted as a standard in the GMOD project. Of course, once data have been retrieved, Import Modules update Chado, either on-demand, or automatically, possibly on a regular basis. An On Line Architecture Mining architecture. One of the advantages of a data warehouse is the ready availability of clean, integrated and consolidated data represented by a multiplicity of dimensions. Once that data are stored in the data warehouse, elementary statistics can be computed on the available facts

A Modular Database Architecture

131

Fig. 2. The architecture of BIOBITS system

and aggregation of measures and frequencies of facts can be immediately computed. The results can be browsed and compared by OLAP primitives and tools. Finally, on these statistics the power of data mining algorithms can be further exploited. This is the On Line Architecture Mining (OLAM) view of a software architecture [20]. OLAM is composed by a suite of data mining algorithms that receive from the client a query for a knowledge discovery task. The request can be answered by the predictive and semi-automatic capabilities of data mining algorithms. In turn, these ones work on the results of an underlying OLAP server that receives the input data from the underlying data warehouse. For the transformation of the data stored in Chado into the star schema of Figure 1 we exploit BioMart [8], which is a software package available inside GMOD. Services on Chado and the Star Schema. In Figure 2, associated to both the Chado instance and to the BIOBITS Data Mart we plan to oﬀer two types of services. The ﬁrst type is implemented on the basis of existing modules of GMOD. Figure 2 highlights them in the uppermost dashed box, named GMOD GUI Modules. The second type of services are internal to the real BIOBITS system: they are shown in Figure 2 inside the central dashed box, named BIOBITS system. Now, we discuss the latter components in detail, putting much emphasis on the features of the software modules that we speciﬁcally develop in support to the realization of the goals of the project. GMOD Graphical User Interface Modules. These modules exploit the available GMOD modules using Chado database to provide the researchers with the tools for comparative genomics needed by the BIOBITS project. GUI modules have

132

P. Bonfante et al.

also a graphical user interface and allow the user to interact with the system. In particular, – CMap allows users to explore comparisons of genetic and physical maps. The package also includes tools for maintaining map data; – GBrowse is a genome viewer, and also permits the manipulation and the display of annotations on genomes; – GBrowse syn is a GBrowse-based synteny browser designed to display multiple genomes, with a central reference species compared to two or more additional species; – Sybil is a system for comparative genomics visualizations; – MartExplorer and MartView are two user interfaces allowing the user to explore and visualize the stored experimental results and the database content. BIOBITS system specific modules. The goal of these modules is to allow data analysis under two perspectives that should complement each other and serve for validation. The ﬁrst perspective is the one oﬀered by the Case Base Reasoning module. It supports eﬃcient retrieval strategies in the context of the search for genomic similarity and syntenies, directly operating on our implementation of the star schema inside BioMart. The other perspective will exploit tools from Data Mining. We shall use them to perform advanced elaboration on the genomic data. Among the data mining modules we foresee modules for classiﬁcation, for feature selection and clustering. The latter will be discussed in more detail in this paper, since it has been the ﬁrst to be integrated into the BIOBITS system. Indeed, one of the main goal of the whole BIOBITS project is to provide the results of fragment alignment tools. Since clustering provides a speciﬁcally useful service for the exploration and elaboration of the similarities among genes and proteins, its results could provide to the syntheny tools additional information that would enhance the fragment elaboration. As a concluding remark, the plan is to develop BIOBITS system speciﬁc modules as web-based GUI in order to gain user-friendliness and a good degree of interoperability, similar to current GMOD modules that are able to connect to other modules by standard interfaces. Of course we shall adhere to the open source philosophy. So, any BIOBITS system speciﬁc module will be available as part of the whole project GMOD.

4

Software Modules to Support Researchers’ Activities

The main contribution of the BIOBITS project is the development of two GMOD modules to analyse the knowledge stored in the data warehouse. The following section describes the details of these new modules based on Case Based Reasoning and clustering.

A Modular Database Architecture

4.1

133

Case-Based Reasoning

Within the BIOBITS architecture, we worked at the design and implementation of an intelligent retrieval module, which implements the retrieval step of the CaseBased Reasoning (CBR) [2] cycle. CBR is a reasoning paradigm that exploits the knowledge collected on previously experienced situations, known as cases. The CBR cycle operates by (1) retrieving past cases that are similar to the current one and by (2) reusing past successful solutions; (3) if necessary, past solutions are properly adapted to the new context in which they have to be used; ﬁnally (4) the current case can be retained and put into the system knowledge base, called the case base. It is worth noting that purely retrieval systems, leaving to the user the completion of the reasoning cycle (steps 2 to 4), are very valuable decision support tools [38], especially when automated adaptation strategies can hardly be identiﬁed, as in biology and medicine [28]. This is exactly the strategy we are following in the current approach. Our retrieval module is meant to support comparative genomics studies that represent a key instrument to: (1) discover or validate phylogenetic relationships, (2) give insights on genome evolution, and (3) infer metabolic functions of a particular organism. In the module, cases are genomes, each one taken from a diﬀerent organism, and properly aligned with the same reference organism. Indeed, the alignment task is a prerequisite in our library. For this reason we start describing the selected sequences alignment strategy, then we detail our module deep down into the cases representation and retrieval. Sequence Alignment. To deal with the alignment task we rely on BLAST [3]. BLAST is a state-of-the-art local alignment algorithm, speciﬁcally designed for bioinformatics applications. It takes as an input a sequence of nucleotides and properly aligns it to a database of strings belonging to (diﬀerent) organisms of interest. From a typical BLAST output (Figure 3) one can extract basic information (percentage of the sequence that shows identity and length of the sequence alignment) that can be easily plotted as represented in Figure 4. Case Representation. From an application viewpoint, it makes sense to convert the quantitative similarity values in Figure 4 to a set of qualitative levels (e.g. low, medium, high similarity). This provides a “higher level” view of the information, able to abstract from unnecessary details. To perform the conversion, we exploit a semantic-based abstraction process, similar to the Temporal Abstractions (TA) techniques, described in [40,5]. Indeed, in our domain, we consider as the independent variable the symbol position in the aligned strings, instead of the time. As in TA we move from a point-based to an interval-based representation of data, where the input points are the symbol positions, and the output intervals (episodes) aggregate adjacent points sharing a common behavior, persistent over the sequence. In particular, we rely on state abstractions [5], to extract episodes associated with qualitative levels of similarity between the two aligned strings, where the mapping between qualitative abstractions

134

P. Bonfante et al.

Fig. 3. BLAST sequence alignment

and quantitative values of the similarity has to be parametrized on the basis of domain semantic knowledge. Semantic knowledge can also support a further reﬁnement of the state abstraction symbols, according to a taxonomy like the one described in Figure 5. Obviously, the taxonomy can be properly modiﬁed depending on speciﬁc domain needs. Moreover, our tool allows the representation of the available sequences at any level of detail, according to a taxonomy of granularities, like the one depicted in Figure 6. This granularity change makes sense from a biological point of view: consider e.g. that a region may be conserved among relative organisms, while a speciﬁc gene within the region may not. Thus, a high similarity at the region level might be diﬃcultly identiﬁed at the level of single genes (as it will be shown in the example discussed in Subsection 5.1). Notice that the taxonomy of the granularities deﬁnition is strongly inﬂuenced by domain semantics. For instance, the number of nucleotides which composes a gene depends on the speciﬁc organism, and on the speciﬁc gene. Domain knowledge also strongly inﬂuences the conversion of a string of symbols from a given granularity to a diﬀerent one, as required for ﬂexible retrieval. To summarize, case representation is obtained as follows. First, an optimal alignment of two nucleotide strings is calculated by BLAST. In particular, for each subsequence of nucleotides, a percentage of similarity with the aligned nucleotide in the paired string is provided. Abstractions on such quantitative levels are then calculated, and allow to convert these values into qualitative ones, expressed as strings of symbols. Abstractions are calculated at the ground level in the symbol taxonomy (and operate also at the ground level in the granularity taxonomy, since they work on nucleotides, see Figure 6). The resulting string of symbols is then stored in the case library as a case. Despite the fact that cases are stored as abstractions at the ground level, they could be easily converted at

A Modular Database Architecture

135

Fig. 4. A graphical visualization of sequence alignment Any

L

Lv

M

Lm

M

H

Hm

Hv

Fig. 5. An example taxonomy of state abstraction symbols; for instance, the high (H) symbol specializes into very high (Hv ) and moderately high (Hm )

coarser levels in both dimensions (i.e. the dimension of the taxonomy of symbols, and the one of granularities). Such conversion is the means by which we support ﬂexible case retrieval and will be described below.

Fig. 6. A taxonomy of sequence granularities

136

P. Bonfante et al.

Case Retrieval (query answering). Taking advantage from the multi-level abstraction representation introduced above, we support flexible retrieval. In particular, we allow users to express their queries for case retrieval at any level of detail, both in the dimension of data descriptions (i.e. at any level in the taxonomy of symbols) and in the dimension of the granularity. Obviously, since cases are stored at the ground level in both dimensions, in order to identify the cases that match a speciﬁc query, the analyst must provide a function for scaling up (up henceforth) two or more symbols expressed at a speciﬁc granularity level to a single symbol expressed at a coarser one. Moreover, a proper distance function must be deﬁned. The data structures described above, as well as the up and the distance functions, have to be detailed on the basis of the semantics of the speciﬁc application domain. However, we have identiﬁed a set of general “consistency” constraints, that any meaningful choice must satisfy, in order to avoid ambiguous or meaningless situations. For instance, we enforce the fact that distance monotonically increases with ordering in the symbol domain. Moreover, distance “preserves” ordering also in the case in which is a relationships between symbols are involved. For example, the distance between L (low) and M (medium) is smaller than the distance between L (low) and Hv (very high). The exhaustive presentation of such constraints is outside the scope of this paper, but can be found in [29]. In order to increase eﬃciency, our framework also takes advantage of multidimensional orthogonal index structures, which allow for early pruning and focusing in query answering. Indexes are built on the basis of the data structures previously described. The root node of each index is a string of symbols, deﬁned at the highest level in the symbol taxonomy, i.e. the children of “Any”, as shown in Figure 5, and in the granularity taxonomy. A –possibly incomplete, index stems from each root, describing reﬁnements along the granularity and/or the symbol dimension. An example multi-dimensional index, rooted in the H symbol, is represented in Figure 7. Note that, in the ﬁgure, granularity has been chosen as the leading dimension, i.e. the root symbol is ﬁrst specialized in the granularity dimension. From each node of the resulting index, the sequence of the symbols of the node itself is then orthogonally specialized in the secondary (i.e. the symbol) dimension, while keeping granularity ﬁxed. However, the opposite choice for instantiating the leading and the secondary dimensions would also be possible. Each node in each index structure is itself an index, and can be deﬁned as a generalized case, in the sense that it summarizes (i.e. it indexes) a set of cases. This means that the same case is typically indexed by diﬀerent nodes in one index (and in the other available indexes). This supports ﬂexible querying, since, depending on the level at which the query is issued, according to the two taxonomies, one of the nodes can be more suited to provide a quick answer. To answer a query, to enter the more proper index structure, we ﬁrst progressively generalize the query itself in the secondary dimension (i.e. the symbol taxonomy in the example), while keeping the leading dimension (i.e. granularity

A Modular Database Architecture

137

granules state taxonomy region

H HvHvLv..Hv

gene

HHH..H .... HHH..L .... HHL..H HvHvLm..Hm nucleotide

H..HM..L

M..HM..L

Fig. 7. An example of a multi-dimensional orthogonal index

in the example) ﬁxed. Then, we generalize the query in the other dimension as well. Following the generalization steps backwards, we can enter the index from its root, and descend along it, until we reach the node which ﬁts the leading dimension level of the original query. If an orthogonal index stems from this node, we can descend along it, always following the query generalization steps backwards. We stop when we reach the same detail level in the secondary dimension as in the original query. If the query detail level is not represented in the index, because the index is not complete, we stop at the most detailed possible level. We then return all the cases indexed by the selected node. It is worth noting that indexes may be incomplete with respect to the taxonomies. Index reﬁnement can be automatically triggered by the storage of new cases in the case base, and by the types of queries which have been issued so far. In particular, if queries have often involved, e.g. a symbol taxonomy level which is not yet represented in the index(es), the corresponding level can be created. A proper frequency threshold for counting the queries has to be set to this end. This policy allows to augment the indexes discriminating power only when it is needed, while keeping the memory occupancy of the index structures as limited as possible. As a last remark, a number of tools to support comparative genomics studies are already available. For example, the VISTA tool (http://genome.lbl.gov/vista/index.shtml) allows the visualization of pre-computed pairwise and multiple alignments of whole genome assemblies. Our tool, beside alignments visualization, also allows to mine genomes at multiple levels: customized searches can be performed, to retrieve genomes and/or genomic segments matching speciﬁc features as described by the query at the desired granularity. Furthermore, thanks to this tool, queries can be performed eﬃciently and potentially on very large databases. The novelties introduced are exempliﬁed in section 5, with the addition of a performance study. 4.2

Clustering Modules

In this paper we do not go in detail in describing all the predictive and exploratory capabilities oﬀered by data mining algorithms. The aim of this section is to depict a portrait built on a single example: clustering. It oﬀers the possibility to show the beneﬁts in terms of interoperability,

138

P. Bonfante et al.

extendability and ﬂexibility oﬀered by a modular system built upon a data warehouse in which a multi-dimensional representation of a ground set of facts is stored. On these data, whenever it is needed, a query can be issued by the user in order to retrieve from the data warehouse the values of the interesting subset of dimensions. On this initial set of values multi-level reasoning is possible exploiting the relationships between facts in the knowledge network. One of the classical aims of clustering is to provide a description of the data by means of an abstraction process. In many applications, the end-user is used to study natural phenomena by the relative proximity relationships existing among the analyzed objects. For instance, he/she compares organisms by means of the relative similarity in terms of the common features with respect to a same referential example. Many Hierarchical Clustering (HC) algorithms have the advantage that are able to produce a dendrogram which stores the history of the merge operations (or split) between clusters. Moreover, the dendrogram produced by a hierarchical clustering algorithm constitutes a useful, immediate and semantic-rich conceptual organization of the object space. As a result HC algorithms produce a hierarchy of clusters and the relative position of clusters in this hierarchy is meaningful because it implicitly tells the user about the relative similarity between the cluster elements. HC approaches help the experts to explore and understand a new problem domain. As regards the exploitation of object distances, clustering algorithms oﬀer immediate and valuable tools to the end-user for the biological analysis. Co-clustering. A kind of clustering algorithm particularly useful in biological domains is co-clustering [14] whose solution provides contemporaneously a clustering of the objects and a clustering of the attributes. Further, often coclustering algorithms exploit similarity measures on the clusters in the other dimension of the problem: that is, clusters of objects are evaluated by means of the clusters on the features and vice versa. They simultaneously produce a hierarchical organization in two of the problem dimensions: the objects and the features that describe the objects themselves. In many applications both hierarchies are extremely useful and are searched for. In a more formalized view, a co-clustering algorithm is an unsupervised data mining method that computes a bi-partition of a dataset X ∈ Rn×m . A bipartition of a dataset is a triple R, C, ψ, where R is a partition of rows (object instances) into |R| subsets, C is a partition of columns (object attributes) into |C| subsets, and ψ is a relation that associate elements of R to elements of C. An extension of the algorithm based on co-clustering has been obtained by the introduction of constraints. Constraints are very eﬀective in many applications, including gene expression analysis [34] and sequence analysis [13], since the user can express which type of biological knowledge leads to the association among the clusters of genes (the objects) and the clusters of biological conditions (the attributes). The goal of the constrained co-clustering algorithm is to ﬁnd a bi-partition such that a given objective function is optimized and a set of user-deﬁned constraints are satisﬁed. Two kinds of constraints, i.e. must-link and cannot-link,

A Modular Database Architecture

139

should be exploited. A must-link constraint speciﬁes that two rows (respectively, columns) of X must belong to the same cluster. Conversely, a cannot-link constraint speciﬁes that two rows (respectively, columns) of X cannot belong to the same cluster. In general, the satisfaction of constraints may decrease the theoretical optimum of the objective function. Notice also that the satisfaction of a conjunction of constraints is not always feasible. A constrained co-clustering algorithm works as follows. During each iteration, it associates each row to the nearest row cluster which does not violate any cannot-link constraint. If a row is involved in a must-link constraint the algorithm associates the whole set of rows involved in this constraint to the selected row cluster. Furthermore, it controls that any cannot-link constraint is not violated. This process is iterated until the function reaches a desired value, i.e. its decrease is smaller than a user deﬁned threshold τ . The same process is simultaneously performed over the columns of the matrix.

5

Case Study

The recent eﬀorts of several sequencing projects to explore the genomes of organisms from various lineages have provided great resources for comparative genomics. Since the beginning of the postgenomic era, investigators faced how to manage the rising number of public collections of genomes in novel ways [16]. Other than the public databases where sequences are deposited, more speciﬁc data warehouses have been developed [23] were the incorporated data types include annotation of (both protein and non-protein coding) genes, cross references to external resources, and high throughput experimental data (e. g. data from large scale studies of gene expression and polymorphism visualised in their genomic context). Additionally, on such platforms, extensive comparative analysis could be performed, both within deﬁned clades and across the wider taxonomy. Furthermore, sequence alignments and gene trees resulting from the comparative analysis can be accessed. Computational challenges in the ﬁeld of comparative analyses have been overcome [39]. The developed tools have helped in elucidating the genomic structures of a multiple levels of prokaryotes [6], leading to a much improved understanding of why a bacterial genome is organized in the way it is. A number of comparative analyses closer to our ﬁeld of investigation have already shade lights on the characterization of genomes of host-associated and freeliving bacteria [41,4,32,11,10]. Novel computational approaches on large scale datasets provide a new viewpoint for whole genome analysis and bacterial characterization. For example, the self-attraction clustering approach allowed classiﬁcation of Proteobacteria, Bacilli, and other species belonging to Firmicutes [35], whereas the research of protein [18] or genomic [37] signatures have been useful to elucidate the evolutionary relationships among the Gammaproteobacteria and to provide new insights into the evolution of symbiotic diversity, microbial metabolism and host-microbe interactions in sponges. One major focus of comparative sequence analysis is the search for syntenies. The term synteny is used to mean a set of genes that share the same relative

140

P. Bonfante et al.

ordering on the genome of diﬀerent species. In BIOBITS project we are interested on a synteny between several species in order to recognize which are the species more similar to Ca. G. gigasporarum. The evolutionary relationships of these genomes may allow the identiﬁcation of homologous genes and non-coding functional elements, such as regulatory elements and protein domains. To reach this purpose we exploit the BIOBITS system architecture (shown in Section 3) and the Chado modules described in this paper (see Section 4). To show the reliability of our approach we perform a sequence analysis on a well-known bacterial genus. 5.1

Querying for Synthenies on the Region DCW

Following the Data Mart star schema reported in Figure 1, the data related to a bacterium belonging to the genus of Burkholderia (i.e. Burkholderia xenovorans) has been extracted. In details, four tables of the Chado database (i.e. Gene family, Gene annotation, Genome composition, and Organism) are exploited to extract genes belonging to a speciﬁc region called Division Cell Wall (DCW). This region is involved in the synthesis of peptidoglycan precursors and cell division. DCW cluster is composed of 14 genes: FtsA, FtsI, FtsL, FtsQ, FtsW, FtsZ, mraW, mraY, mraZ, murC, murD, murE, murF, murG. The prominent feature of the DCW cluster is that it is conserved with an high (H) similarity in many bacterial genomes over a broad taxonomic range. Speciﬁcally, notwithstanding some bacteria belonging Burkholderia xenovorans simply miss one of the 14 genes, all of them maintain a high similarity at the DCW region level with their relatives. Suppose that a user, interested in comparing bacteria on the basis of the DCW cluster content, asks the ﬂexible retrieval system (see section 4) the following query: Hv Hv Lv Hv Hv Hv Hv Hv Hv Hv Hv Hv Hv Hv looking for the speciﬁc bacteria missing the third gene, but very similar to the reference one as regards the other genes. The ﬂexible retrieval system will ﬁrst generalize the query in the symbol taxonomy dimension (see Figure 5), providing the string: HHLHHHHHHHHHHH and then in the granularity dimension, providing the query H at the region level. Quite naturally, we deﬁne the up function as: up(HHLHHHHHHHHHHH) = H. This allows to enter the index in Figure 7 from its root. Then, following the generalization step backwards, a node identical to the query can be found, and the ground cases indexed by it can be retrieved. Interactive and progressive query relaxation (or reﬁnement) are supported as well in our framework. In this situation the distance between the original query and the cases indexed by the other children of the node can be calculated by any distance function which satisﬁes the constraints illustrated in [29], and quickly described before. Query relaxation or reﬁnement can be repeated several times, until the user is satisﬁed with the width of the retrieval set. In the Burkholderia example, the user may generalize the initial query as an H at the region level, and

A Modular Database Architecture

141

retrieve also the cases indexed by HHHHHHHHHHHHHH at the gene level (the other siblings of HHLHHHHHHHHHHH do not index any real case in this speciﬁc situation). The cases indexed by HHHHHHHHHHHHHH can thus be listed, clarifying that their distance from the original query is greater than zero. Considering the performance of the Case Based Reasoning module, tests have been conducted on databases containing diﬀerent number of cases. On the left side of Table 1, we report the time elapsed to generate the multi-dimensional indexing structure from the similarity levels generated by BLAST and properly abstracted. The creation times span from 39 seconds to index 2000 cases, to 163 seconds to index 8000 cases. Even if the creation of the structure takes some time, it is necessary to perform this operation only when a new database is installed (or when a signiﬁcant number of new cases is stored); then the ﬂexible and eﬃcient query mechanism can start running. The right side of Table 1 shows the time elapsed to perform a query, which spans from few milliseconds to query on 2000 cases, to less than one second to query on 8000 cases. These experiments were conducted on an Intel Core 2 Duo T9400 processor running at 2.53 GHz, equipped with 4 Gb of DDR2 RAM. Table 1. Execution times to build the multi-dimensional orthogonal index (left) and to execute a query (right) Multidimensional index structure Query execution times generation from BLAST with multidimensional index N. of cases Structure generation time (s) N. of cases Query execution times (s) 2.000 38,969 2.000 0,138 4.000 80,667 4.000 0,333 6.000 121,618 6.000 0,650 8.000 162,241 8.000 0,905

Protein Domains Mining. Beside the investigation of the biological connection at the gene level using the indexing approach, we are able to exploit the cases deriving from the case representation to extract new analogies among nucleotide sequences. In details, we query the Chado database to extract all the protein sequences from the obtained cases. Then, we use the co-clustering modules to study the domain/motif composition of protein sequences. As it is well known, the modular nature of proteins shows many advantages: it provides an increased stability and new cooperative functions. The usage of protein domains in the determination of the proteins functions has become essential. Several web applications (e.g. Pfam [17], SMART [25], Interpro [31]) are available to provide an overview of the domain architecture of a polypeptide sequence, and the functions that these domains are likely to perform. Even though the cited tools allow one to submit a set of protein sequences as input, they perform the domain analysis considering each sequence as a single entity. As a consequence, the user can obtain only a local view of the domain

142

P. Bonfante et al.

composition, instead of a global view, that may emphasize the domains characterizing the entire proteins set. This fact suggests the need of an automatic tool that oﬀers the possibility to manage the results in order to highlight the association between domains and proteins. For this purpose the BIOBITS system includes a de novo algorithm [13]. It allows the simultaneous association between protein sequences and domains/motifs. In this way we are able to identify a richer set of motifs, each one possibly characterizing only some of the sequences in the whole dataset. The algorithm relies on three steps. First, we generate a preﬁx tree starting from the sequences in the input dataset. This data structure enables the fast extraction of all the frequent domains of length up to a ﬁxed value w. Then, we exploit a constrained co-clustering algorithm [34] in order to ﬁnd protein domain classes and the associated protein groups. Finally, we associate the obtained clusters by means of a statistical measure. This measure individuates for each domain cluster the corresponding protein cluster containing it. The statistical measure can associate some protein clusters to any domain cluster, or some protein cluster to more than one domain cluster. In the presented case study we consider a domain as frequent if it is found at least in the 10% of the sequences given as input, and we set the maximum domain length w equal to 15. The dataset matrix X (deﬁned in Section 4.2) is built using the frequency values stored in the preﬁx tree. In the deﬁnition of the co-clustering constraints, we exploit the Levenshtein distance between two strings. Speciﬁcally, we set a must-link constraint on every pair of domains having a distance less than 2. With this limitation, we consider a must-link between two motifs that require only two string operations (i.e. insertion, deletion or substitution) to transform one motif into the other. Otherwise, all the pairs that match by at most two characters are subject to a cannot-link constraint. The stop condition of the co-clustering algorithm is set to be τ = 10−3 . With the above described experimental setting, we performed two types of experiments. In the ﬁrst experiment we compose the set of input sequences by combing all the Burkholderia xenovorans’s protein sequences of genes belonging to DCW cluster, stored in table Protein of Chado database. The aim of this experiment is the identiﬁcation of the protein domains common to a DCW cluster gene subfamily. We obtain the six motifs reported in logos representation in Figure 8: panel (a) shows the sequence logo representation of the two domains associated to the fts gene family while panel (b) reports the sequence logo representation of the four domains associated to mur gene family. In order to validate the reliability of our approach we compare our results with respect to the biological knowledge reported in the review by Clyde A. Smith [36]. Smith describes the three domain architectures characterizing the mur ligases. Two of these domains have essentially conserved topology. The author deeply studied the motif composition of one domain, ATPase. It is characterized by a small number of essential structural motifs that include the P-loop motif. The sequence comparisons reported by Smith show the strong conservation of P-loop motif in all four mur ligases. From our analysis we obtain two motifs strictly

A Modular Database Architecture

Q

A D E

L A

G

L L A

L A R V Q

I R

A H N I

Q S S

143

E K V

A R V

weblogo.berkeley.edu

V

R

P S weblogo.berkeley.edu

(a)

TTT S TGTNGKT GTN KTTT T K

GAL T

T

S A A M A V L

A R L G G S R E L

N

weblogo.berkeley.edu

weblogo.berkeley.edu

AGTHGKTTTT

TGTNGKTTTT

G

Q D D HLK GI weblogo.berkeley.edu

AGTHGKTTTT

A

G

S

T G N A L

K

S N L D A Q T R V

(b)

A

T

L weblogo.berkeley.edu

H M V R

TGSxGKT-[TS]-TK

Fig. 8. Sequence logo representation of the motifs obtained by the co-clustering module on Burkholderia xenovorans’s DCW cluster protein sequences

related to the mur subfamily: in Figure 8(b) we highlight the residues common to the Smith’s consensus sequences. In the second experiment, we exploit the Phylogenetic classification table joined to the Protein table stored in Chado. The purpose of the second experiment is to extend our analysis to other species of Burkholderia. In detail, we single out 13 species: Burkholderia cepacia, Burkholderia ambifaria, Burkholderia cenocepacia, Burkholderia multivorans, Burkholderia phytofirmans, Burkholderia vietnamiensis, Burkholderia glumae Burkholderia xenovorans, Burkholderia dolosa, Burkholderia graminis, Burkholderia phymatum, Burkholderia rhizoxinica and Burkholderia ubonensis. The focus of the task is extending the previous association motif/gene subfamily to all Burkholderia genus. This kind of analysis is linked to the possibility of understanding if there are one or more domains joined between diﬀerent species. Figure 9 shows the co-clusters obtained for genes murC, murD, and murE. Our ﬁndings conﬁrm that the gene subfamilies are associated with at least one motif and this association is shared with all the orthologous sequences in the Burkholderia’s species. The founded domains lead to identify homologous genes, that may catch the evolutionary relationship among a set of genus. The new pieces of information are then stored in the Chado database.

144

P. Bonfante et al.

murC

murD

murE

B. B. B. B. B. B. B. B. B. B. B. B.

ambifaria cenocepacia cepacia complex dolosa glumae graminis multivorans phymatum phytofirmans rhizoxinica vietnamiensis xenovorans

B. B. B. B. B. B. B. B. B. B.

cenocepacia cepacia complex dolosa glumae graminis multivorans phymatum phytofirmans vietnamiensis xenovorans

B. B. B. B. B. B. B. B. B. B. B.

ambifaria cenocepacia cepacia complex glumae graminis multivorans phymatum phytofirmans rhizoxinica vietnamiensis xenovorans

TTTTSL

K

weblogo.berkeley.edu

TTT

T K V S S A G A L T

A N R

A

L M L

weblogo.berkeley.edu

G D HG Q T N L

T

Fig. 9. Co-clusters obtained by performing the co-clustering module on a set of Burkholderia species’ DCW cluster protein sequences

6

Conclusions

In this paper we reported on the on-going BIOBITS project whose goal is to extensively develop a computational genomic comparison (known as syntheny) focused on the Ca. Glomeribacter gigasporarum bacterium and arbuscular mycorrhiza fungi genome. We presented the software architecture essentially developed over an existing software layer provided by GMOD Community. GMOD system oﬀers powerful data visualization and analysis tools, data warehouse modules, such as BioMart and the possibility to exploit import modules for the inclusion of data from the external, public resources. Furthermore, it contains the Chado database which presents an extensible and ﬂexible model for any organism species built upon the generic concept of feature which can be customized by the use of types and ontologies. We presented the logical data representation of the genomic and proteomic components of the biological problem: it has the form of a double star schema the ﬁrst one centered around the genetic fragments composing the genome and the second one on the proteins encoded by the genes.

A Modular Database Architecture

145

Then, we describe the main software blocks of BIOBITS system: a Case-Based Reasoning module and a co-clustering module, which allow the user to retrieve and analyse in a ﬂexible and intelligent way the data coming from the multidimensional star schema. Both these modules complement each other. Case-Based Reasoning and temporal analysis retrieve the information at diﬀerent abstraction levels, as needed by the analyst. Co-clustering provides a novel information to genetic sequences based on computational data mining algorithms. In the last part of the paper, we describe a case study showing how these modules inter-operate to provide new information. Interesting results have been obtained, with a conﬁrmation from other research studies. The conﬁrmed reliability of our approach encourages us to continue our research on the endosymbiont bacterium Candidatus Glomeribacter gigasporarum.

References 1. Acedb, http://www.acedb.org/ 2. Aamodt, A., Plaza, E.: Case-Based Reasoning: foundational issues, methodological variations and systems approaches. AI Communications 7, 39–59 (1994) 3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) 4. Bakker, H., Cummings, C., Ferreira, V., Vatta, P., Orsi, R., Degoriciden ja, L., Barker, M., Petrauskene, O., Furtado, M., Wiedmann, M.: Comparative genomics of the bacterial genus Listeria: Genome evolution is characterized by limited gene acquisition and limited gene loss. BMC Genomics 11 (2010) 5. Bellazzi, R., Larizza, C., Riva, A.: Temporal abstractions for interpreting diabetic patients monitoring data. Intelligent Data Analysis 2, 97–122 (1998) 6. Bentley, S., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–792 (2004) 7. Bianciotto, V., Lumini, E., Bonfante, P., Vandamme, P.: Candidatus Glomeribacter gigasporarum, an endosymbiont of arbuscular mycorrhizal fungi. Int. J. Syst. Evol. Microbiol. 53, 121–124 (2003) 8. BioMart (2003), http://www.biomart.org/ 9. Bonfante, P., Anca, I.: Plants, Mycorrhizal Fungi, and Bacteria: A Network of Interactions. Annu. Rev. Microbiol. 63, 363–383 (2009) 10. Carvalho, F., Souza, R., Barcellos, F., Hungria, M., Vasconcelos, A.: Genomic and evolutionary comparisons of diazotrophic and pathogenic bacteria of the order Rhizobiales. BMC Microbiology 10, 1–12 (2010) 11. Commins, J., Toft, C., Fares, M.: Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biomedical Procedures Online 11, 52–78 (2009) 12. Cordero, F., Ghignone, S., Lanfranco, L., Leonardi, G., Meo, R., Montani, S., Roversi, L.: BIOBITS: A Study on Candidatus Glomeribacter Gigasporarum with a Data Warehouse. In: Bohm, C. (ed.) Database Technology for Life Sciences and Medicine Claudia Plant, ch. 10, pp. 203–220 (2011) 13. Cordero, F., Visconti, A., Botta, M.: A new protein motif extraction framework based on constrained co-clustering. In: Proceedings of the 24th Annual ACM Symposium on Applied Computing, pp. 776–781 (2009) 14. Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proceedings ACM SIGKDD 2003, pp. 89–98 (2003)

146

P. Bonfante et al.

15. Eilbeck, K., Lewis, S.: Sequence Ontology Annotation Guide. Computational Functional Genomics 5(8), 642–647 (2004) 16. Field, D., Wilson, G., van der Gast, C.: How do we compare hundreds of bacterial genomes? Current Opinion in Microbiology 9, 499–504 (2006) 17. Finn, R., Mistry, J., Schuster-Bckler, B., Griﬃths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S., Sonnhammer, E., Bateman, A.: Pfam: clans, web tools and services. Nucleic Acids Res. 34, 247–251 (2006) 18. Gao, B., Mohan, R., Gupta, R.: Phylogenomics and protein signatures elucidating the evolutionary relationships among the Gammaproteobacteria. International Journal of Systematic and Evolutionary Microbiology 59, 234–247 (2009) 19. GenBank (2000), http://www.ncbi.nlm.nih.gov/Genbank/ 20. Han, J., Kamber, M.: Data Mining, Concepts and techniques. Academic press, London (2001) 21. Hu, J., et al.: The ARKdb: genome databases for farmed and other animals. Nucleic Acids Res. 29, 106–110 (2001) 22. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Castro, E.D., Langendijk-genevaux, P., Pagni, M., Sigrist, C.: The prosite database. Nucleic Acids Res. 34, 227–230 (2006) 23. Kersey, P.J., Lawson, D., Birney, E., Derwent, P.S., Haimel, M., Herrero, J., Keenan, S., Kerhornou, A., Koscielny, G., Kahari, A., Kinsella, R.J., Kulesha, E., Maheswari, U., Megy, K., Nuhn, M., Proctor, G., Staines, D., Valentin, F., Vilella, A.J., Yates, A.: Ensembl genomes: Extending ensembl across the taxonomic space. Nucleic Acids Research (November 2009), http://dx.doi.org/10.1093/nar/gkp871 24. Lazzarato, F., Franceschinis, G., Botta, M., Cordero, F., Calogero, R.: RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets. Bioinformatics 20, 2848–2850 (2004) 25. Letunic, I., Copley, R., Pils, B., Pinkert, S., Schultz, J., Bork, P.: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 34, 257–260 (2006) 26. Lumini, E., Ghignone, S., Bianciotto, V., Bonfante, P.: Endobacteria or bacterial endosymbionts? To be or not to be. New Phytol. 170, 205–208 (2006) 27. MAGE Community, MGED Group: MicroArray Gene Expression (MAGE) Project (2000), http://scgap.systemsbiology.net/standards/mage_miame.php 28. Montani, S.: Exploring new roles for case-based reasoning in heterogeneous AI systems for medical decision support. Applied Intelligence 28, 275–285 (2008) 29. Montani, S., Bottrighi, A., Leonardi, G., Portinale, L., Terenziani, P.: Multi-level abstractions and multi-dimensional retrieval of cases with time series features. In: McGinty, L., Wilson, D.C. (eds.) ICCBR 2009. LNCS, vol. 5650, pp. 225–239. Springer, Heidelberg (2009) 30. Moran, N., McCutcheon, A., Nakabachi, P.: Genomics and evolution of heritable bacterial symbionts. Annu. Rev. Genet. 42, 165–190 (2008) 31. Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A., Orchard, S., Orengo, C., Petryszak, R., Selengut, J., Sigrist, C., Thomas, P., Valentin, F., Wilson, D., Wu, C., Yeats, C.: New developments in the InterPro database. Nucleic Acids Res. 35, 224–228 (2007)

A Modular Database Architecture

147

32. Ogier, J., Calteau, A., Forst, S., Goodrich-Blair, H., Roche, D., Rouy, Z., Suen, G., Zumbihl, R., Givaudan, A., Tailliez, P., Medigue, C., Gaudriault, S.: Units of plasticity in bacterial genomes: new insight from the comparative genomics of two bacteria interacting with invertebrates, Photorhabdus and Xenorhabdus. BMC Genomics 11, 1–10 (2010) 33. Osborne, B.: GMOD Community: GMOD (2000), http://gmod.org/wiki/Main_Page 34. Pensa, R., Boulicaut, J.F., Cordero, F., Atzori, M.: Co-clustering Numerical Data under User-deﬁned Constraints. Statistical Analysis and Data Mining (2010) 35. Santoni, D., Romano-Spica, V.: Comparative genomic analysis by microbial COGs self-attraction rate. Journal of Theoretical Biology 258, 513–520 (2009) 36. Smith, C.A.: Structure, Function and Dynamics in the mur Family of Bacterial Cell Wall Ligases. Journal of Molecular Biology 362, 640–655 (2006) 37. Thomas, T., Rusch, D., DeMaere, M., Yung, P., Lewis, M., Halpern, A., Heidelberg, K., Egan, S., Steinberg, P., Kjelleberg, S.: Functional genomic signatures of sponge bacteria reveal unique and shared features of symbiosis. ISME Journal 4, 1557–1567 (2010) 38. Watson, I.: Applying Case-Based Reasoning: techniques for enterprise systems. Morgan Kaufmann, San Francisco (1997) 39. Xu, Y.: Computational Challenges in Deciphering Genomic Structures of Bacteria. Journal of Computer Science and Technology 25, 53–73 (2009) 40. Shahar, Y.: A framework for knowledge-based temporal abstractions. Artiﬁcial Intelligence 90, 79–133 (1997) 41. Zucko, J., Dunlap, W., Shick, J., Cullum, J., Cercelet, F., Amin, B., Hammen, L., Lau, T., Williams, J., Hranueli, D., Long, P.: Global genome analysis of the shikimic acid pathway reveals greater gene loss in host-associated than in free-living bacteria. BMC Genomics 11 (2010)

[KD3 ] A Workflow-Based Application for Exploration of Biomedical Data Sets Andreas Dander1,2,3,, , Michael Handler4, , Michael Netzer4 , Bernhard Pfeifer4 , Michael Seger4 , and Christian Baumgartner4 1

3

Institute for Bioinformatics and Translational Research, UMIT, Hall in Tirol, Austria 2 Oncotyrol, Center for Personalized Cancer Medicine, Innsbruck, Austria Biocenter, Division for Bioinformatics, Innsbruck Medical University, Innsbruck, Austria [email protected] 4 Institute of Electrical, Electronic and Bioengineering, UMIT, Hall in Tirol, Austria {michael.handler,michael.netzer,bernhard.pfeifer, michael.seger,christian.baumgartner}@umit.at

Abstract. Based on the biotechnological revolution in the past years, molecular biology has become increasingly data-driven. Knowledge Discovery in Databases, a well-known process in the ﬁeld of bioinformatics, is supporting the biological research process from data integration, knowledge mining to data interpretation. This work proposes a new software suite, termed Knowledge Discovery in Databases Designer (KD3 ), covering the complete Knowledge Discovery in Databases process using a workﬂow-oriented architecture. Three diﬀerent application-oriented modules are implemented in KD3 : First, the Designer for designing speciﬁc workﬂows. These workﬂows can be used by the Interpreter, which allows to load and parameterize existing workﬂows. The Launcher encapsulates one dedicated workﬂow into an independent application to answer one speciﬁc biomedical question. KD3 oﬀers a variety of implemented methods, which can be easily extended with new customized components using functional objects. All components can be connected to workﬂows, which may contain elements of other applications. Keywords: Knowledge Discovery in Databases, Data Mining, Biomedical Data Exploration, Statistical Analysis, Workﬂow, Bioinformatics, Software.

1

Introduction

1.1

Motivation

In the past years, huge advances in high-throughput technologies, such as microarrays, mass-spectrometry or high-throughput sequencing have led to a rapid

Corresponding author. These authors contributed equally to this work.

A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 148–157, 2011. c Springer-Verlag Berlin Heidelberg 2011

[KD3 ] Knowledge Discovery in Databases Designer

149

growth of amounts of data. Furthermore, publicly available databases are collecting data generated in biological or clinical studies and partly provide free access to this data pool. Examples for such databases are ArrayExpress [27], Gene Expression Omnibus (GEO) [6], or the Sequence Read Archive (SRA) [22]. Knowledge Discovery in Databases (KDD) is a crucial process to support biomedical research and provides a way to manage these enormous data sets, beginning with data integration, subsequent analysis of data, and biological interpretation of ﬁndings. One possible application of KDD is in the ﬁeld of biomarker discovery. Disease biomarkers are indicators which facilitate diagnosis, aid in determining the severity of disease, or allow clinicians to assess the response to therapy. As standard spreadsheet applications are not able to deal with the complexity of this process and are not suited to handle the massive amount of data, new tools need to be developed. A new workﬂow-based knowledge discovery tool, termed Knowledge Discovery in Databases Designer (KD3 ) is proposed here. KD3 supports the KDD process including data integration, data preprocessing and visualization, and can readily be extended by bioinformaticians and software developers. Three diﬀerent application-oriented modules of KD3 are introduced: (i) the Designer that allows the development and implementation of new algorithms, and modeling of new sophisticated data analysis workﬂows, (ii) the Interpreter to address a variety of research questions employing pre-deﬁned workﬂows with the possibility to change parameters, and (iii), the Launcher, an automatically generated, tailored stand-alone application, which is used for executing a dedicated workﬂow to answer one speciﬁc question. The purpose of the Launcher is to easily distribute designed workﬂow solutions to research partners. 1.2

Knowledge Discovery in Databases

Fayyad et al. [11] deﬁned the interactive and iterative KDD process as follows: “KDD is the nontrivial process of identifying valid, novel, potentially useful, and understandable patterns in data”. This process involves numerous steps with many decisions to be made by the user. In the following these steps are brieﬂy summarized. 1. Research Question: Researchers need to develop a basic understanding of the application domain and identify the aim of the KDD process regarding their research question. 2. Selection of Data of Interest: In this step researchers need to select a target data set from a study, which contains the data of interest for the discovery process. The subsequent analysis is carried out by focusing on a subset of variables or data samples. 3. Preprocessing and Cleaning: Operations such as noise reduction, handling of missing values and outliers, data transformation, homogenization of diﬀerent domains, or the removal of redundant attributes are performed in this step.

150

A. Dander et al.

4. Data Mining: A key task of the knowledge discovery process is data mining. In this step researchers need to choose between diﬀerent data mining tasks, such as feature selection, classiﬁcation (supervised mining), cluster analysis (unsupervised mining), regression analysis or the use of meta models for identifying or classifying novel patterns in the data. 5. Interpretation/Evaluation: In the ﬁnal step researchers need to evaluate and interpret mined patterns or possibly reiterate previous steps if results are not plausible. Subsequently, the new knowledge is documented, reported, or published to interested parties, including approval with established knowledge. KD3 [30] has been developed to provide researchers with a tool that covers the entire KDD process including data selection, preprocessing, transformation, data mining, and visualization of ﬁndings. Alternative tools that support the entire KDD process, or at least parts of it, comprise diﬀerent packages in R [13], the open source tools Orange [9], Weka [15], KNIME [8], RapidMiner [24], and Tanagra [32] as well as commercial products such as PASW Modeler [1] from SPSS and Spotfire Miner [3] from TIBCO. In contrast to all previously mentioned tools, KD3 uses three modules (Designer, Interpreter and Launcher ) for diﬀerent user groups facilitating cooperative work and collaborations. Once a workﬂow has been designed and validated, a standalone application for the deﬁned workﬂow, called Launcher, can be generated using the Designer. After the Launcher is generated, it can be distributed to project partners as a single executable Java Archive (jar) ﬁle.

2

Implementation

As platform independency was an essential prerequisite for developing the KD3 application, it has been developed using the programming language Java 1.6, where SWING was selected for designing the components of the graphical user interface (GUI). KD3 allows software engineers to simply extend the functionality of the framework by encapsulating new algorithms in functional objects (FOs) (e.g. the Load CSV FO for accessing data of a CSV ﬁle shown in both workﬂows of Fig. 2). Newly developed FOs just need to be copied into a speciﬁc directory and KD3 integrates them like plug-ins by using the Java Reﬂection API. Additionally, the GUI for each integrated FO is generated automatically, and can be conﬁgured with annotations used in the integrated FO. Due to the modular design of KD3 other software tools such as Weka or R can be easily encapsulated within FOs, and incorporated into workﬂows. Based on this feature, diﬀerent software products are connected with newly developed algorithms, ingenious established methods, or other applications within a workﬂow.

[KD3 ] Knowledge Discovery in Databases Designer

2.1

151

Workflow Execution

A workﬂow is deﬁned as a network of connected FOs and in order to be valid and executable it has to fulﬁll the following two criteria: First the workﬂow must include start points, which are FOs without any preliminary dependency, and second the workﬂow must not contain cycles, i.e. when traversing the connected FOs, no element must be passed more than once (see Fig. 1).

Fig. 1. Cycle in KD3 workﬂow. The execution of this workﬂow is not possible due to a cycle between Functional Object 2 and Functional Object 3.

If a workﬂow does not fulﬁll these criteria its execution is not initiated. Otherwise, the launch of a valid workﬂow triggers all start points. As each FO runs in its own thread, workﬂow paths can be executed in parallel and are independent from each other. 2.2

Automatically Generated Tailored Stand-Alone Applications

For each workﬂow a new, independent stand-alone application, called Launcher, can be built using the Designer. As a result a tailored Java archive ﬁle (jar) for executing a speciﬁc workﬂow is available. This jar-ﬁle contains all necessary Java classes and libraries for the speciﬁed workﬂow, and is executable on any operating system with installed Sun Java 1.6 or newer. With these customized applications users are able to provide the community with data analysis workﬂows for speciﬁc biomedical questions. 2.3

Implemented Functional Objects in KD3

This section provides an overview of key FOs used in the mentioned steps of the KDD process. As the selection step is accessing data from diﬀerent sources, a number of FOs have been developed to assist this step. Various ﬁle formats, R Excel File), or ARFF such as CSV (Comma-Separated Values), XLS (Microsoft (Attribute-Relation File Format) can be opened, read and saved. A Querybuilder FO is available for connecting diﬀerent types of relational databases, such as MySQL or PostgreSQL.

152

A. Dander et al.

A library for basic data table operations has been developed. It supports sorting and merging data tables, and includes FOs for basic transformation processes like transposing or splitting tables, and removing columns and lines of a table. For the preprocessing step various FOs have been developed to support data transformation by methods such as the z-transformation based on mean and standard deviation, logarithm, and other mathematical scaling methods. In addition, KD3 provides several FOs for descriptive statistics, outlier detection, hypothesis testing, and normalization. A suite of FOs is available for the data mining step. For feature selection, algorithms such as the Information Gain [31], ReliefF [20], Biomarker Identiﬁer [7], Stacked Feature Ranking (SFR) [26], have been implemented. In addition the KD3 library provides FOs for statistical analyses, such as ANOVA [4], KruskalWallis-Test [21], Student’s t-Test [4], and many more. Furthermore, there are FOs that support clustering analysis such as KMeans [23], hierarchical clustering [18], OPTICS [5], and others. Moreover, correlation analysis using the JSC library [2] is supported as well. KD3 supports users in the evaluation step using diﬀerent charts for subsequent visualization of these data mining methods, which enables users to ﬁnd and interpret interesting patterns. Therefore, this step is assisted by diverse visualization techniques [14] such as scatter plots [4], histograms [19], box plots [33], or ROC plots [4].

3

Results and Discussion

A new workﬂow-based knowledge discovery tool (KD3 ) has been developed. KD3 supports users from diﬀerent research ﬁelds, including data warehouse developers, statisticians, data miners, software engineers, biologists, and physicians. The software package consists of three modules for diﬀerent application areas: Firstly, the Designer which allows an easy development, implementation, and testing of new algorithms, workﬂows and procedures. Secondly, the Interpreter to address speciﬁc questions by loading and executing deﬁned workﬂows. The third module, called Launcher, is used for executing a dedicated workﬂow on a speciﬁc problem, and can be generated automatically using the Designer. Therefore, the Launcher incorporates all necessary resources in one executable jar-File. This application can be simply shared with the research community, and can also be added as supplementary material to a publication. New algorithms can be integrated into the system by encapsulating algorithms within Java classes. Using deﬁned annotations, the GUI can be generated automatically for each FO, and due to that reason, there is no need to develop a special GUI for each algorithm. Nevertheless, developers with experience in development of user interfaces using the Java API can create own customized GUIs for FOs. FOs are provided for a variety of tasks for example, diﬀerent external types of data (ﬂat ﬁles and databases) can be loaded and connected. The user is able to select from a collection of preprocessing and transformation procedures,

[KD3 ] Knowledge Discovery in Databases Designer

153

data mining algorithms, and various visualization methods. An example workﬂow and an introduction on how to design new workﬂows are available under http://kd3.umit.at/. Additionally, a guideline for developing new FOs is provided at the mentioned webpage (“Hitchhiker’s Guide for Generating FunctionalObjects”). 3.1

Application Example

This section summarizes a recently published study using KD3 . The aim of this study, published by Netzer et al. [26], was to ﬁnd breath gas biomarkers that can be used to distinguish between patients with liver disease and healthy controls. Data was generated by analyzing breath gas using ion molecule reaction mass spectrometry (IMR-MS) [17]. Fig. 2 depicts a screenshot of a typical KD3 application, where both shown workﬂows were used for identifying breath gas marker candidates in liver disease. The left panel shows a list of available algorithms, either KD3 speciﬁc or integrated from external applications. One visualization window and two diﬀerent exemplary data analysis workﬂows are shown in the middle of the screen. The selected FO can be parameterized in the right panel, and after executing the workﬂow a small preview at the bottom of the right panel displays the result of the selected FO.

Fig. 2. Screenshot of a typical workﬂow in KD3 . A tree of available FOs is displayed in the left panel. Information about the currently selected FO (Detect IQR Outlier) is shown in the right panel (top: location, middle: conﬁguration, bottom: result visualization). Two workﬂows and two plots as a result of the FO ViewPlotGenerator are depicted in the center.

154

A. Dander et al.

The workﬂow shown in the left panel, reads the data from a CSV-ﬁle and removes outliers in a next step. In the selected FO Detect IQR Outlier outliers are deﬁned as observations outside the range [Q1 − k · IQR; Q3 + k · IQR], where Q1 and Q3 are the ﬁrst and third quartiles, and IQR = Q3 − Q1 is the interquartile range. The parameter k conﬁgures the threshold for deﬁning outliers. This parameter was set to 3 to remove by deﬁnition “strong” outliers [26]. The next step in the search for clinically relevant biomarker candidates was to rank 114 features (volatile organic compounds in the breath gas) using the Stacked Feature Ranking (SFR) algorithm. It revealed that CH4 is the top ranked marker candidate for distinguishing between non-alcoholic fatty liver disease (NAFLD) and controls. In parallel to SFR the linear correlations between gas compounds were calculated. Note that high correlations between breath gas compounds may indicate uncovered and unexpected relations in chemical reactions useful for interpreting ﬁndings. Finally, the two diﬀerent results for SFR and linear correlation are exported to independent CSV-ﬁles. Note that the execution of the independent paths using the StackedFeatureRanking and the LinearCorrelationFinder is processed simultaneously. The workﬂow in the right panel has been used for the visualization of the ROC curve using CH4 as feature of interest in the left plot and a scatter plot together with Acetylen in the right plot. Interestingly, with the given workﬂows new promising breath gas candidates for liver disease were identiﬁed, some of which could be veriﬁed by literature. In this study, SFR was also compared with other popular data mining algorithms for feature selection (i.e. Information Gain, ReliefF, Biomarker Identiﬁer, and statistical testing), all of them available in KD3 . The results demonstrated that SFR outperforms other methods signiﬁcantly in terms of the area under the ROC curve (AUC) and is proposed as a powerful tool for the search of biomarker candidates in high-throughput data of mass-spectrometry. 3.2

Discussion

Due to the diﬀerent steps and possible ways that are necessary for a proper evaluation of biomedical data, workﬂows for performing analytical and/or computational tasks are important. KD3 has already been successfully used in several research projects dealing with biomarker discovery [7,16,25,26,28]. One of the main advantages is that the various workﬂow-design modules enable great usability by providing a preview of results for each step of the workﬂow. All integrated algorithms can be connected and combined within constructed workﬂows. Additionally, the application supports parallel calculation of workﬂow paths to allow optimal performance of each KDD workﬂow in question. A further strength of KD3 is the possibility to develop and implement new algorithms in an easy manner by using so-called functional objects (FOs). Newly developed algorithms can easily be compared with already established ones, by creating branches in the workﬂow.

[KD3 ] Knowledge Discovery in Databases Designer

155

External software tools and packages, as described in Sect. 1.2, can be simply integrated in KD3 like plug-ins using FOs. Therefore, diﬀerent algorithms implemented in diﬀerent tools can be compared with each other and with newly developed methods. Several data mining methods that already exist in Weka have been integrated using this methodology. Furthermore, advantages of diﬀerent tools can be combined into one single workﬂow. Since generic types are used in the application for transferring data between diﬀerent FOs, KD3 can also be applied in other research areas, such as image processing [10], signal processing, chemoinformatics [12], simulation [29], and diverse ﬁelds of bioinformatics. At present, more than 100 diﬀerent FOs supporting the KDD process (i.e., data mining algorithms, graphical representation and data handling tools, tools for executing external processes, query builder components, and others) are available, showing that the development and the integration of new FOs is an ever-ongoing process.

4

Conclusions

The Knowledge Discovery in Databases Designer (KD3 ) covers the entire KDD process by being able to analyze diﬀerent types of biomedical data. The proposed concept of three diﬀerent modules (Designer, Interpreter, Launcher ) is one of the key features of KD3 , as it provides great ﬂexibility when using the application. Software engineers can leverage already implemented methods, and easily integrate new algorithms. As it is straightforward to integrate external applications into KD3 , our tool simpliﬁes the validation of new algorithms by comparing them with established ones, including methods of several other applications. As a result, KD3 oﬀers a variety of diﬀerent methods and applications and allows developers and researchers to encapsulate other software tools. All components can be connected to tailored workﬂows, which enable users to answer speciﬁc biomedical questions as demonstrated in the example for biomarker discovery in breath gas samples. KD3 is proposed as a powerful software tool for developers and users which greatly supports the development of new methods, and the discovery of novel knowledge out of biological and clinical data. Acknowledgements. The authors would like to thank Stephan Pabinger for his constructive comments and proofreading of the ﬁnal manuscript. This work was supported by the COMET Center ONCOTYROL and funded by the Federal Ministry for Transport Innovation and Technology (BMVIT) and the Federal Ministry of Economics and Labour (BMWA), the Tiroler Zukunftsstiftung (TZS) and the Styrian Business Promotion Agency (SFG) [and supported by the University for Health Sciences, Medical Informatics and Technology, the Graz University of Technology and Biomax Informatics AG.], and the Austrian Genome Research Program GEN-AU (Bioinformatics Integration Network, BIN III).

156

A. Dander et al.

References 1. IBM SPSS Modeling Family, http://www.spss.com/software/modeling/ 2. Java Statistical Classes, http://www.jsc.nildram.co.uk/ 3. TIBCO Spotﬁre Miner, http://spotfire.tibco.com/products/data-mining-applications.aspx 4. Altman, D.: Practical Statistics for Medical Research. Chapman & Hall/CRC (1991) 5. Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999) 6. Barrett, T., Troup, D., Wilhite, S., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I., Soboleva, A., Tomashevsky, M., Marshall, K., et al.: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 37(database issue), D885 (2009) 7. Baumgartner, C., Lewis, G., Netzer, M., Pfeifer, B., Gerszten, R.: A new data mining approach for proﬁling and categorizing kinetic patterns of metabolic biomarkers after myocardial injury. Bioinformatics 26(14), 1745–1751 (2010) 8. Berthold, M., Cebron, N., Dill, F., Gabriel, T., K¨ otter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., Wiswedel, B.: KNIME: The Konstanz Information Miner. In: Studies in Classiﬁcation, Data Analysis, and Knowledge Organization (GfKL 2007). Springer, Heidelberg (2007) 9. Demsar, J., Zupan, B., Leban, G.: Orange: From Experimental Machine Learning to Interactive Data Mining. Tech. rep., Faculty of Computer and Information Science, University of Ljubljana (2004) 10. Dougherty, G.: Digital Image Processing for Medical Applications. Cambridge University Press, New York (2009) 11. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine 17, 37–54 (1996) 12. Gasteiger, J., Engl, T.: Chemoinformatics: A Textbook, 1st edn. WILEY-VCH, Chichester (2003) 13. Gentleman, R.: R Programming for Bioinformatics. Chapman & Hall/CRC Computer Science and Data Analysis (2008) 14. Grinstein, G., Ward, M.: Introduction to data visualization. In: Fayyad, U., Grinstein, G., Wierse, A. (eds.) Information Visualization in Data Mining and Knowledge Discovery, vol. 1, pp. 21–45. Morgan Kaufmann Publishers Inc., San Francisco (2002) 15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009) 16. Herbig, J., Seger, M., Kohl, I., Mayramhof, G., Titzmann, T., Preinfalk, A., Winkler, K., Dunkl, J., Pfeifer, B., Baumgartner, C., Hansel, A.: Online breath sampling with PTR-MS - A setup for large screening studies. In: Proc. 4th Int. Conf. on Proton Transfer Reaction Mass Spectrometry and Its Applications (2009) 17. Hornuss, C., Praun, S., Villinger, J., Dornauer, A., Moehnle, P., Dolch, M., Weninger, E., Chouker, A., Feil, C., Briegel, J., et al.: Real-time monitoring of propofol in expired air in humans undergoing total intravenous anesthesia. Anesthesiology 106(4), 665 (2007) 18. Johnson, S.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967) 19. Kaminsky, F., Benneyan, J., Davis, R., Burke, R.: Statistical control charts based on a geometric distribution. Journal of Quality Technology 24(2), 63–69 (1992)

[KD3 ] Knowledge Discovery in Databases Designer

157

20. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994) 21. Kruskal, W., Wallis, W.: Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association 47, 583–621 (1952) 22. Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. [Epub ahead of print] (2010) 23. MacKay, D.: An Example Inference Task: Clustering. In: MacKay, D. (ed.) Information Theory, Inference, and Learning Algorithms, vol. 1, pp. 284–292. Cambridge University Press, Cambridge (2003) 24. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: YALE: Rapid Prototyping for Complex Data Mining Tasks. In: Ungar, L., Craven, M., Gunopulos, D., Eliassi-Rad, T. (eds.) KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–940. ACM, New York (2006) 25. Millonig, G., Praun, S., Netzer, M., Baumgartner, C., Dornauer, A., Mueller, S., Villinger, J., Vogel, W.: Non-invasive diagnosis of liver diseases by breath analysis using an optimized ion-molecule reaction-mass spectrometry approach: a pilot study. Biomarkers 15(4), 297–306 (2010) 26. Netzer, M., Millonig, G., Osl, M., Pfeifer, B., Praun, S., Villinger, J., Vogel, W., Baumgartner, C.: A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry. Bioinformatics 25(7), 941–947 (2009) 27. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., et al.: ArrayExpress update–from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Research 37(database issue), D868 (2009) 28. Pfeifer, B., Aschaber, J., Baumgartner, C., Modre, R., Dreiseitl, S., Schreier, G., Tilg, B.: A data warehouse for prostate cancer biomarker discovery. In: CohenBoulakia, S., Tannen, V. (eds.) DILS 2007. LNCS (LNBI), vol. 4544, p. 9. Springer, Heidelberg (2007) 29. Pfeifer, B., Kugler, K., Tejada, M., Baumgartner, C., Seger, M., Osl, M., Netzer, M., Handler, M., Dander, A., Wurz, M., Graber, A., Tilg, B.: A Cellular Automaton Framework for Infectious Disease Spread Simulation. The Open Medical Informatics Journal 2, 58–69 (2008) 30. Pfeifer, B., Tejada, M., Kugler, K., Osl, M., Netzer, M., Seger, M., Modre-Osprian, R., Schreier, G., Tilg, B.: A Biomedical Knowledge Discovery in Databases Design Tool - Turning Data into Information. In: eHealth (2008) 31. Quinlan, R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 32. Rakotomalala, R.: TANAGRA: un logiciel gratuit pour l’enseignement et la recherche. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 697–702. Springer, Heidelberg (2005) 33. Williamson, D., Parker, R., Kendrick, J.: The Box Plot: A Simple Visual Method to Interpret Data. Annals of Internal Medicine 110(11), 916–921 (1989)

A Secured Collaborative Model for Data Integration in Life Sciences Hasan Jamil Department of Computer Science Wayne State University, USA [email protected]

Abstract. Life Sciences research extensively and routinely use external online databases, tools and applications for the implementation of computational pipelines. These applications are among the truly distributed and highly collaborative global systems in existence. Since the resources these applications use are designed to serve individual users, they adopt an all-or-nothing model in which users necessarily have to accept the entire response even though only a fraction of the response is relevant. In computational pipelines involving several databases and complex repeat operations, costs due to unnecessary data transmissions and computations could be signiﬁcant enough to reduce productivity and make the applications sluggish. Since these resources are autonomous, and do not accept user instructions or queries, users are not able to customize their behavior in order to reduce network latency and wasteful computation or data transmission. Obviously, such a resource utilization and sharing model is wasteful and expensive. In this paper, our goal is to propose a new collaborative data integration and computational pipeline execution model for systems biology research. We show that in our envisioned model, arbitrary sites are able to accept user constraints and limited processing instructions to avoid wasteful computation resulting in improved overall eﬃciency. We also demonstrate that the proposed collaborative model does not breach site security or infringe upon its autonomy.

1

Introduction

Systems biology computational pipelines most often use complicated workﬂows or application logic that access many distributed databases, online tools and data analysis software in local machines. The distributed nature of these pipelines pose signiﬁcant data integration challenges that the community has been trying to address for quite some time. The capabilities and the ﬂexibilities available to a computational pipeline largely depends upon the sophistication of the adopted model. The adopted model also determines the overall cost of data analysis for an application which includes the cost of maintenance in addition to the run time query execution cost. Traditionally, most systems biology applications warehouse remote data and needed tools on local machines by manually reconciling the schema heterogeneity A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 158–187, 2011. c Springer-Verlag Berlin Heidelberg 2011

A Secured Collaborative Model for Data Integration in Life Sciences

159

and establishing schema correspondence through schema mapping. Additionally, format transformation functions are implemented to adjust to the formatting needs of various analysis tools and software residing in local machines. Since the data and tools reside in local machines, the user can exercise total control and is able to design applications that are computationally optimized for performance. From a purely computational standpoint, this warehousing model delivers the best run time eﬃciency among all possible data integration models. However, the run time performance comes at a huge cost of maintenance, update latency and ﬂexibility. Since data and tools are maintained locally, the user must assume responsibility of ensuring that the warehoused data and tool suite used are current. This maintenance, known as view maintenance, is extremely complicated and expensive when updates are frequent at the source. Needless to say that using obsolete versions may lead to incorrect computation and signiﬁcant waste of resources since the ﬁndings will have to be discarded once the version mismatch is discovered. Furthermore, depending on how the source communicates the updates with its users, discovering the changes may mean replacing the entire warehouse1 . Since the warehouse often needs to employ sophisticated update discovery and maintenance systems to stay current, usually large databases are updated less frequently as it involves downloading substantial amount of remote data over the network. The warehousing approach also suﬀers from lack of ﬂexibility in choices and scope. Since all data and tools have to be downloaded in local machines, substantial hardware, software and human resources become necessary. This limits what analysis can be supported by the warehouse because any analysis or study that requires resources not maintained in the warehouse becomes impossible. An alternative has been to write applications by dedicated communication with the data sources, again manually mediating the schema. While this approach removes the physical downloading of the source contents and oﬀers currency, it still requires manual mediation, coping with changes in the source, and writing source speciﬁc glue codes that cannot be reused. The basic assumption here is that the sources are autonomous and oﬀers a “use as you see” and hands oﬀ support, i.e., the application writer receives no support in any form from the sources other than access. There has been a signiﬁcant eﬀort to alleviate the burden on the application writers for this alternative approach by developing libraries in popular scripting languages such as Perl and PHP for accessing and using popular resources such as GenBank [4], UCSC [23], PDB [5], etc. These sources can change their structures without invalidating these libraries, and we have to necessarily write applications using the sources for which tested scripts are available. Consequently, applications that demand change, access to new resources, are transient or ad hoc, and not ready to commit to signiﬁcant maintenance overhead (as in the former approach) remain ill served. 1

This is possible when the source does not identify which data item has been updated and how.

160

H. Jamil

A slightly more secure approach is to allow access to the database, called the hidden web, through predeﬁned web-forms so that users can access the content in a predeﬁned way. Usually the response to such form submissions are returned in HTML. In this environment, users are not only required to use sophisticated applications to ﬁll out forms, gather responses and extract the results in meaningful ways, but also to transport the extracted data to the database for secondary use in their local machines. The interaction follows a client-server model in which the client application interacts with the remote sites individually such that the sites are unaware of any other interactions. Figure 1 shows the data integration, site interaction or computational pipeline model of the two approaches discussed above. As shown, hidden web sites serve as a “black-box” service provider without any form of collaboration. User applications are responsible for reconciling schema and data heterogeneity, and applications must accept the entire responses these sites generate for queries submitted through the forms. Hence, ﬁltering and preprocessing data for onward submission to another site in the subsequent stage, is the application’s responsibility. Such a model ignores the cost of computation and transmission of irrelevant data, which is often signiﬁcant.

Hidden web Local table

Web interface

Hidden database

Yellow input Next hidden web Pink input A

B

C

D

E

Yellow table Pink table

Combined green table Stage two

Stage one

Fig. 1. Schematic view of traditional computational pipelines using hidden web

In this paper, we propose a new collaborative approach to on-the-ﬂy autonomous information integration that removes several of the hurdles in accessing Life Sciences resources, at a throw away cost and without any need for strict coupling or dependence among the sources and applications. Our goal is to preserve site autonomy, and maintain the site’s existing security policies, yet oﬀer protocols to reduce unwanted computation and eliminate irrelevant data transmission to improve eﬃciency. We exploit ideas from query plan generation and distributed transaction processing in databases to develop a collaborative data integration model that is sensitive to site security vulnerabilities. The strategy we adopt is to require sites to accept additional information and instructions relevant to a web interface and provide assistance in carrying out client instructions without compromising established security policies toward improving overall performance of user applications.

A Secured Collaborative Model for Data Integration in Life Sciences

2

161

A Novel On-the-fly Data Integration Model

Traditional data integration approaches based on global as view (GAV) or local as view (LAV) often become limiting or inapplicable in many emerging applications, especially where conventional assumptions do not apply. For example, to follow either a GAV or LAV based integration model, it is essential that the mapping between the source and global schema is well deﬁned, possibly well in advance. Examples of applications based on LAV and GAV models include BioGuideSRS [7] and MetaQuerier [17] respectively. But many applications, especially in Life Sciences, do not require such stringent view of sources, or even meet the assumptions of GAV and LAV based integration models. Data integration in this domain predominantly takes the form of data heterogeneity as opposed to schema heterogeneity, and a computational pipeline where data moves from one site to the next which possibly includes reconciliation of schema heterogeneity. We explain these issues in the following section using a real life application in Life Sciences in the context of the data integration system LifeDB [6] and its query language BioFlow [21] developed in our laboratory. LifeDB stands apart from other integration systems such as BioGuideSRS and MetaQuerier in the sense that it does not actually advocate a GAV or LAV type schema integration. Instead, LifeDB views data integration from the standpoint of on-the-ﬂy schema mediation, black-box information access through web forms and computational pipeline implementation where no global apriori scheme exists. 2.1

LifeDB Data Integration System: BioFlow by Example

To illustrate the capabilities of LifeDB, we adapt a real life Life Sciences application discussed in [16] which has been used as a use case for many other systems and as such can be considered a benchmark application for data integration. A substantial amount of glue codes were written to implement the application in [16] by manually reconciling the source schema to ﬁlter and extract the information of interest. Our goal in this section is to show how simple and eﬃcient it is to develop this application in LifeDB. The query, or workﬂow, the user wants to submit is the hypothesis: “the human p63 transcription factor indirectly regulates certain target mRNAs via direct regulation of miRNAs”. If positive, the user also wants to know the list of miRNAs that indirectly regulate other target mRNAs with high enough conﬁdence score (i.e., pV alue ≤ 0.006 and targetSites ≥ 2), and so he proceeds as follows. He collects 52 genes along with their chromosomal locations (shown partially in ﬁgure 2(a) as the table genes) from a wet lab experiment using the host miRNA genes and maps at or near genomic p63 binding sites in the human cervical carcinoma cell line ME180. He also has a set of several thousand direct and indirect protein-coding genes (shown partially in ﬁgure 2(d) as the table proteinCodingGenes) which are the targets of p63 in ME180 as candidates. The rest of the exploration thus proceeds as follows. He ﬁrst collects a set of genes (geneIDs) for each of the miRNAs in the table genes, from the web site www.microrna.org by submitting one miRNA at a time

162

H. Jamil

geneName pValue

geneID

miRNA

FLJ36874

hsa-mir-10a

targetSites pValue 10

0.004

FLJ36874

hsa-mir-10b

3

null

miRNA

chromosome

microRNA

hsa-mir-10a

ch 17

hsa-mir-10a

FLJ36874

0.004

RUNDC2C

hsa-mir-205

8

null

hsa-mir-205

ch 1

hsa-miR-196b

MYO16

0.009

MYO16

hsa-miR-196b

null

0.009

(b) sangerRegulation

(a) genes

(e) regulation geneID

miRNA

targetSites pValue p63Binding

geneID

miRNA

targetSites

Gene

p63Binding

FLJ36874 hsa-mir-10a

10

0.004

Y

FLJ36874

hsa-mir-10a

10

FLJ36874

Y

FLJ36874 hsa-mir-10b

3

null

Y

FLJ36874

hsa-mir-10b

3

RUNDC2C

Y

RUNDC2C hsa-mir-205

8

null

Y

RUNDC2C hsa-mir-205

8

MYO16

N

null

0.009

N

(c) micrornaRegulation

(d) proteinCodingGene

MYO16

hsa-miR-196b

(f) proteinCodingGeneRegulation

Fig. 2. User tables and data collected from microRNA.org and microrna.sanger.ac.uk

in the window shown in ﬁgure 3(a), that returns for each such gene, a set of gene names that are known to be targets for that miRNA. The site returns the response as shown in ﬁgure 3(b), from which the user collects the targetSites along with the gene name partially shown as the table micrornaRegulation in ﬁgure 2(c). To be certain, he also collects the set of gene names for each miRNA in table genes from microrna.sanger.ac.uk in a similar fashion, partially shown in table sangerRegulation in ﬁgure 2(b). This time the column targetSites is not available, so he collects the pValue values. Note that the scheme for each of the tables are syntactically heterogeneous, but semantically similar (i.e., miRNA ≡ microRNA, geneName ≡ geneID, and so on). He does so because the data in the two databases are not identical, and there is a chance that querying only one site may not return all possible responses. Once these two tables are collected, he takes a union of these two sets of gene names (in micrornaRegulation and sangerRegulation), and ﬁnally selects the genes from the intersection of the tables proteinCodingGene (that bind to p63, i.e., p63Binding = ‘N’) and micrornaRegulation ∪ sangerRegulation as his response.

(a) microRNA.org input form.

(b) microRNA.org returned page.

Fig. 3. Typical user interaction interface at microRNA.org site

A Secured Collaborative Model for Data Integration in Life Sciences

163

To compute his answers in BioFlow using LifeDB, all he will need to do is execute the script in ﬁgure 4 to fully implement the application. In a recent work [19], we have shown how a visual interface called VizBuilder for BioFlow can be used to generate this script by an end user without any knowledge of BioFlow in a very short amount of time. It is also interesting to note that in this application, the total number of data manipulation statements used are only seven (statements (2) through (8)). The rest are usual data deﬁnition statements. We refer the readers to [6,21] for the meanings of these statements and a discussion on the semantics of this script. But we would like to mention here that ad hoc integration, information aggregation, and workﬂow design using BioFlow in LifeDB is very simple compared to leading contemporary systems such as Taverna [20] and Kepler [2], and data management systems such as BioMediator [29], BioGuideSRS [7], and MetaQuerier [17]. In the script in ﬁgure 4, the statements numbered (1) through (7) are unique to BioFlow and thus of the most importance. The define function statements (2) and (4) essentially declare an interface to the web sites at URLs in the respective from clauses, i.e., microrna.org and microrna.sanger.ac.uk. The extract clause speciﬁes what columns are of interest when the results of computation from the sites are available, whereas the submit clauses say what inputs need to be submitted. In these statements, it is not necessary that the users supply the exact variable names at the web site, or in the database. The wrapper (FastWrap) and the matcher (OntoMatch) named in the using clause and available in the named ontology mirnaOntology, actually establish the needed schema correspondence and the extraction rules needed to identify the results in the response page. Essentially, the define function statement acts as an interface between LifeDB and the web sites used in the applications. This statement was ﬁrst introduced in [11] as the remote user deﬁned function for databases where the input to the function is a set of tuples to which the function returns a table. However, the construct in [11] was too rigid and mechanistic, with the user needing to supply all the integration instructions. Actually, it could not use a wrapper or a schema matcher. The user needed to supply the exact scheme and exact data extraction rules. In BioFlow, it is more declarative and intuitive. To invoke the form functions and compute queries at these sites, we use call statements at (3) and (5). The ﬁrst statement calls getMiRNA for every tuple in table genes, while the second call only sends one tuple to getMiRNASanger to collect the results in tables micrornaRegulation and sangerRegulation. Statements (6) and (7) are also new in BioFlow, and capture, respectively, the concepts of vertical and horizontal integration in the literature2 . The combine statement collects objects from multiple tables, potentially with conﬂicting schemes into one table. To do so, it also uses a key identiﬁer (such as gordian [27]) to recognize objects across tables. Such concepts have been investigated in the literature 2

In vertical integration, we collect similar objects from diﬀerent sources into one set to enlarge the collection (cardinality of a relation) whereas in horizontal integration we extend the information content of objects by acquiring more data from other sources (increase the degree of a relation).

164

H. Jamil

process compute_mirna (1) { open database bioflow_mirna; drop table if exists genes; create datatable genes { chromosome varchar(20), start int, end int, miRNA varchar(20) }; load data local infile ‘/genes.txt’ into table genes fields terminated by ’\t’ lines terminated by ‘\r\n’; drop table if exists proteinCodingGene; create datatable proteinCodingGene { Gene varchar(200), p63binding varchar(20) }; load data local infile ‘/proteinCodingGene.txt’ into table proteinCodingGenes fields terminated by ‘\t’ lines terminated by ‘\r\n’; drop table if exists micrornaRegulation; create datatable micrornaRegulation { mirna varchar(200), targetsites varchar(200), geneID varchar(300) }; define function getMiRNA extract mirna varchar(100), targetsites varchar(200), geneID varchar(300) using wrapper mirnaWrapper in ontology mirnaOntology from "http://www.microrna.org/microrna/getTargets.do" submit( matureName varchar(100), organism varchar(300) ); (2) insert into micrornaRegulation call getMiRNA select miRNA, ‘9606’ from genes ; (3) drop table if exists sangerRegulation; create datatable sangerRegulation { microRNA varchar(200), geneName varchar(200), pvalue varchar(200) }; define function getMiRNASanger extract microRNA varchar(200), geneName varchar(200), pvalue varchar(30) using wrapper mirnaWrapper in ontology mirnaOntology from "http://microrna.sanger.ac.uk/cgi-bin/targets/v5/hit_list.pl/" submit( mirna_id varchar(300), genome_id varchar(100) ); (4) insert into sangerRegulation call getMiRNASanger select miRNA, ‘2964’ from genes ; (5) create view regulation as combine micrornaRegulation, sangerRegulation using matcher OntoMatch identifier gordian; (6) create view proteinCodingGeneRegulation as link regulation, proteinCodingGene using matcher OntoMatch identifier gordian; (7) select * from proteinCodingGeneRegulation where pValue <= 0.006 and targetSites >= 2 and p63binding=‘N’; (8) close database bioflow_mirna; }

Fig. 4. BioFlow script implementing the process

A Secured Collaborative Model for Data Integration in Life Sciences

165

under the titles record linkage [30] or object identiﬁcation [24]. For the purpose of this example, we adapted GORDIAN [27] as one of the key identiﬁers in BioFlow. The purpose of using a key identiﬁer is to recognize the ﬁelds in the constituent relations that essentially make up the object key3 , so that we can avoid collecting non-unique objects in the result. The link statement, on the other hand, extends an object in a way similar to join operation in relational algebra. Here too, the schema matcher and the key identiﬁer play an important role. Finally, the whole script can be stored as a named process and reused using BioFlow’s perform statement. In this example, line (1) shows that this process is named compute mirna and can be stored as such for later use. 2.2

LifeDB Data Integration Model

It may be fairly obvious from the script above that LifeDB adopted a model similar to the one shown in ﬁgure 1, in which the application assumes the responsibility of coordinating the query execution plan by individually interacting with the participating sites that respond only to predeﬁned queries by computing one of many possible views. Although we have collected two similar sets of objects from sanger.org and microrna.org, the global schema is not predeﬁned. Finally, we have accessed these two sites to create a uniﬁed view of their contents based on their local scheme. In other words, we blurred the distinction between GAV and LAV. Interestingly, the combine statement in (6) actually removes the restrictions imposed by GAV and LAV because it retains all possible information under these two integration models individually, whereas the link statement in (7) extends objects in source databases in a way not possible under GAV and LAV as shown in ﬁgure 5.

S

J

K

Returned table

I

Returned table

K

A

B

C

Z

R

D

E X

Y

Q

F

G

Extracted H

E

H

F

F

table

E

I

G

H

K

Returned table

P

Extracted table

T U

W

Hidden web one

J

Returned table

Input scheme and values

Hidden web two

I

Fig. 5. Vertical (left) and horizontal (right) data integration in LifeDB 3

Note that object key in this case is not necessarily the primary keys of the participating relations.

166

H. Jamil

Unlike GAV and LAV, the model in ﬁgure 5 uses the queries as the driver of integration without a required global scheme using a “best eﬀort” approach similar to [8]. In this approach, the input/output behavior of the web forms (the interface to the hidden web) are matched with the query scheme (the scheme of the input relation and the variable list in the select clause) to the extent possible. In the example in ﬁgure 5, the input scheme {A, B, C, D} is mapped to the form entries of ﬁrst hidden web form {W, X, Y, Z}. The site responds with a table with scheme {E, F, G, H}, which in turn is mapped to the select variables {P, Q, R}. Once the schema correspondence is established (shown with arrows), and query selection conditions are applied, only a subset of the rows and columns are selected for the combined table (shown with arrows). From the second hidden web, we follow a similar procedure. The only diﬀerence is that we are now only able to map a subset of our input values and our select clause variables to the web form. Under the best eﬀort assumption, currently we require that all the input values need to be mapped to input form entries, but not all the select clause variables need to be found in the site response. Consequently, the table extracted from a site response may not have all the columns, and often needs to be padded with null values. In this example, it is shown that the table extracted from site two did not have an identiﬁable scheme either. The scheme {I, J, K} has been assigned ﬁrst using a best eﬀort annotation algorithm and then mapped to the overall scheme {P, Q, R}. We are able to map only {I, K} to {P, R} respectively, and pad Q with nulls. This approach requires that we “identify” objects, not tuples, and collect the objects with as much information as possible. We have adopted the traditional approach to object recognition [28,15,30] and apply a procedure to extract objects as opposed to tuples in a relation. In this sense, our approach to the combine operation is entirely diﬀerent from the traditional union or outer union operation [18], but resemble the fusion of outer union with entity resolution. The link operation as shown in the right of ﬁgure 5 has a similar approach to schema mapping and table extraction. However, this operation is distinct from a natural join or theta join operation as it uses discovered objects as join conditions, which is potentially diﬀerent from the joining columns (same name attributes). 2.3

Computational Pipeline Coordination and Cost Model

In most scientiﬁc computational pipelines, as well as in LifeDB, the coordination model is relatively rudimentary and straightforward. In these models, the application directs the control ﬂow, communicates with the sites in series, collects the responses, and moves from stage to stage as shown in ﬁgure 1, and abstracted in ﬁgure 6 in which the application does not expect site cooperation beyond the services they already provide. The application assumes the responsibility for schema mapping, table extraction, missing column name annotation, object recognition, table consolidation and ﬁltration. Therefore, network transmission time needed to send and receive result or intermediate relations can be considered the dominant cost of computation as communication speed is signiﬁcantly

Filtered DB 1 response

Initial data

Filtered DB 1 response

167

Hidden web DB 1

Final response

User application

Hidden web DB 2

A Secured Collaborative Model for Data Integration in Life Sciences

Fig. 6. LifeDB computational pipeline coordination model

lower than local computation4 . This cost can be measured in the number of bytes in a tuple that are transmitted in either direction. Previous research has developed sophisticated transaction and coordination models for distributed query processing, where data integration is not a serious consideration. In other words, the distributed scheme is well understood and a tight coupling of the database engines is assumed. In our approach, these assumptions do not hold, and databases are fully autonomous and mostly opaque to the outside world even though they cooperate in a client-server type model. Our goal in this paper is to preserve the opacity of sites, yet make them accept instructions from clients toward reducing computational cost at the site, as well as reduce network communication cost by eliminating unnecessary data transmission without compromising the site security model. Although the sites will now assume some of the functions of the client, or the application, the overall computational eﬃciency will be improved.

3

A Collaboration Model for Computational Pipelines

To curtail the cost of unnecessary network transmission during a computational pipeline execution, we must receive some form of cooperation from the participating hidden web sites that traditionally do not accept anything but form submissions. Forms typically accept user input data and query conditions. These inputs are then converted into a well structured query – usually an SQL query when the underlying database management system uses a relational engine. To improve application performance, there are two levels of cooperation a site potentially can provide without compromising its security. To retain autonomy and independence, all sites essentially adopt the black-box behavior. In other words, it still remains the responsibility of the application to decide how to use the site resources and what semantics to assign to the responses the site generates. In the next two section we discuss how the hidden web sites could help improve application eﬃciency by accepting data ﬁltration and process coordination instructions from applications toward reducing data transmission cost, and client side post-processing cost without compromising server side performance. 4

Traditional distributed database models also use this simpliﬁed cost model for query cost evaluation.

168

H. Jamil

3.1

Reduced Data Transmission through Collaboration

First, the site can assist by returning a table with no redundant tuples from an application’s viewpoint. The basic idea is depicted in ﬁgure 7 in which we show that the hidden web site now becomes responsible for the functions of the application in ﬁgure 1. In other words, the site now sends one optimized response as opposed to a series of large tables corresponding to each form submission. That means, a site must accept two pieces of information – a mapping from application query scheme to the scheme of the response table the site will generate, and a ﬁlter condition that the application would like to enforce. Hidden web Local table

Web interface

Hidden database

Yellow input Next hidden web Pink input A

B

C

D

E

Yellow table Pink table

Combined green table Stage two

Stage one

Fig. 7. Modiﬁed computational pipeline using cooperative hidden web resources

There are at least two barriers to this collaboration scheme. Ideally, an application does not know the scheme of the returned table before submitting the form as it is following an on-the-ﬂy integration protocol. Therefore, it cannot send the mapping information for the table to the site, and must request the site to carry out the mapping. Since mapping is often an application speciﬁc feature, and potentially can determine the application semantics, the choice of the mapping algorithm must reside with the application. Unfortunately, since a site cannot possibly host all possible mapping algorithms nor can it accept an executable for security reasons, the application must either send it, or help the site ﬁnd it. To avoid a potential security breach, we believe that a trusted site can host possible mapping algorithms that a server site can download from to carry out the mapping, using the application’s mapping of choice. The resulting model will have the appearance of the schematic diagram in ﬁgure 8, called synchronous coordination. In this coordination model, the application acts as the central communicator and query execution manager. Alternately, the application or client site could host a mapping service of its own to assist the server site to compute the mapping. This approach is interesting when the application is using exotic schema matchers speciﬁc to the application and most other application will not beneﬁt from it. Thereby depositing the mapper at any trusted site will result in little or no beneﬁt. In this paper, however, we will follow the ﬁrst approach and assume that a trusted site exists from which a site can download the schema matching algorithm.

A Secured Collaborative Model for Data Integration in Life Sciences

Trusted mapping information Control information

Initial input data Filtered DB 1 response Trusted mapping information Control information

Hidden web DB 1

Filtered DB 1 response

User application

Hidden web DB 2

Final response

169

Trusted repository

Fig. 8. Synchronous coordination

To reduce the size of the ﬁnal result, the simplest solution is to apply the selection condition on the view created by the site and project out the unneeded columns after schema mapping. Although it optimally reduces the size of the table to be transmitted, it misses the opportunity to help the site reduce the computational cost. A better approach would be to push down the application constraints to the site query so that unnecessary tuples are not generated in the ﬁrst place. Such cost reduction will help the sites balance the additional cost they incur to compute application queries. We will revisit this issue in section 4 to discuss secured collaboration. 3.2

Coordination Models for Collaboration

The synchronous coordination model in ﬁgure 8 still requires the site to forward the optimized table to the coordinating site, i.e., the application. If the application simply needs to forward the table returned by site 1 to site 2, as shown in ﬁgure 8, then we can further improve the performance by requesting site 1 to simply forward or submit the table to site 2 directly. That way we are able to save one complete transmission. Again for this to be successful, we will need a mapping from the table scheme to the site 2 web form scheme, and communicate a possible mapping information to site 2 as we did in the case of site 1. The collaboration model under this scenario will be signiﬁcantly diﬀerent than the previous approach. The idea of the new coordination model can be described using the diagram in ﬁgure 9 in which we show a distributed coordination. In the distributed coordination model, called the asynchronous coordination model, an application sends not only mapping information, input data and selection condition to the site, it also sends the query plan, and other control information. The site, upon completion of its portion of the computation in the query plan, forwards the required portion of the remaining query plan and control information to the next site alongwith the table it produced as input. Optionally, the site may inform the application the status of the computation to improve coordination. The complexity and sophistication of such an approach is not the subject of this article. Instead we are merely interested in proposing the model, and developing the detailed protocols as a complete research to be reported elsewhere. The basic and abstract model of coordination is shown in

170

H. Jamil Inputs Responses Trusted Repository μ Second hidden web DB

User Application

Inputs result

Responses

Input

op1 op2

op3

op4

μ

op5 First hidden web DB

Fig. 9. Asynchronous collaboration model. Solid arrows show actual data transfers and dashed arrows show control information transfers.

Trusted mapping information

Hidden web DB 2

Control information Control information

Trusted mapping information

Hidden web DB 1 Query plan

Initial input data

Filtered DB 1 response

User application

Query plan Control information Additional input data Query response

Trusted repository

Fig. 10. Asynchronous coordination

ﬁgure 10, which diﬀers with the synchronous model in that each site now shares part of the coordination and query plan, and the application no longer has full knowledge of the state of computation. A hybrid of synchronous and asynchronous models may become essential when the application needs to augment the pipeline with site speciﬁc additional information (in the form of data or control information), needs to have a monitoring or supervisory role, or if it is simply expensive to send the whole or part of the information down the chain of controls at each site. The longer or more complex the pipeline, the more control information will be required to be transmitted. A more sophisticated coordination similar to distributed transaction processing and coordination may become necessary to be optimal.

4

Reducing Query Cost Using Parameterized Views

Usually each site responds to a form submitted by a client by computing a view over its database back-end. Even though it uses a structured database, it often exposes the computed view as a semi-structured document, usually in HTML, over the internet so that the content is viewable based on a style sheet. The scheme of the exposed view is often transformed as well for presentation purposes, or to protect the structure of the internal database scheme. Many hidden web sites often remove the scheme of the view because the data is expected to be consumed by a human end user who is able to gather the meaning naturally. The

A Secured Collaborative Model for Data Integration in Life Sciences

171

diagram in ﬁgure 11 explains this black-box interaction model of the sites where arrows show the scheme mappings. This layered separation of the structure of the database makes it diﬃcult to design an autonomous system to manipulate view deﬁnition without site cooperation. In this section, our goal is to devise a mechanism to allow some degree of view manipulation without security breaches. Input form at site

Returned table by the site r g h e f

a c b d

A

B

C

I J K Computed table by application

D s

Input table scheme X

Y

P

Q

W

X

t

U

V

select I, J, K from r where J < 0.2 and L = 28

Database

Fig. 11. Hidden web site and application interaction model

Let us assume that the site uses a query Q over the database tables s and t as shown below to compute the view presented in ﬁgure 11 that the application calls table r. select X, P, W from s, t where s.X = t.X and U < 17 and Y = d and V = b and Q = c To be able to safely push a selection condition θ = J < 0.2 ∧ L = 28 as shown in the SQL query in ﬁgure 11, we must rewrite query Q as Q as follows: select X, P, W from s, t where s.X = t.X and U < 17 and Y = d and V = b and Q = c and W < 0.2 and P = 28 Rewriting Q into Q is relatively straightforward when the mapping among the variables are known, and the query condition θ is given. In fact, the server interfaces routinely rewrite queries based on form input entries and executes the modiﬁed queries. In our case, the base query Qb is perhaps of the form (using the syntax and convention in Roichman and Gudes [26]) select X, P, W from s, t where s.X = t.X and U < 17 and Y = :d and V = :b and Q = :c which is rewritten into Q by replacing : x variables with the corresponding values x from the interface slots. Roichman and Gudes [26] observed that uncontrolled

172

H. Jamil

use of such interface parameters to rewrite queries could lead to a security breach called SQL injection. To illustrate this security vulnerability, consider the SQL query in the form of a string strSQL = “select Salary from Salary Table where Employee No = 123 and Salary Date = ’“ + dateParam + ”’” where dateParam is a user supplied string to be concatenated to form the ﬁnal SQL query. Now if the user will enter 01/2011’ or ’1’ = ’1, then we’ll get: select Salary from Salary Table where Employee No = 123 and Salary Date = ’01/2011’ or ’1’ = ’1’ This expression will now return salary information of all employees because the disjunct ’1’ = ’1’ is a tautology. Therefore, there is signiﬁcant risk in allowing even apparently harmless rewriting of the view deﬁnition. Roichman and Gudes suggested a security protocol using passkey to deal with this breach. Unfortunately, a passkey is not an acceptable solution in scientiﬁc databases where a user could land, without notice, on a site virtually from anywhere in the globe, and maintaining any form of passkey or adopting a grant/revoke type of authorization is not feasible. Therefore, we must devise a non-intrusive yet eﬀective method for accepting user constraints without any security breach. 4.1

Passing Parameters to Views

Deductive languages such as Datalog [10] are arguably richer and more powerful in the way views can be manipulated as part of queries. Traditionally, SQL views do not accept parameters, and hence any ﬁltration must be carried as a remedial operation. This is one of the reasons why in practical web interfaces and other user directed query evaluation engines, a literal query composition based approach is adopted to accept user constraints in remote databases at signiﬁcant risk for security breaches. In the next section, we review the limitations of relational views as opposed to views in deductive languages on intuitive grounds. Inadequacy of Relational Views. SQL views are traditionally used to create a restricted subset (called relational restrictions) of a set of database relations that possibly involve complex operations such as aggregation and other specialized functions. Views are useful for numerous important data processing activities. For example, views can be used to create the user’s universe of discourse, enforce controlled access or authorization policies, implement predeﬁned queries, pose complex queries in a stepwise fashion, and implement stored procedures. However, views are essentially static and cannot be used for changing needs. That is, views are deﬁned with speciﬁc applications in mind and hence, are not customizable. More speciﬁcally, the conditions in the where clause are ﬁxed and cannot be modiﬁed without changing the program. This rigidity can be considered a severe limitation of SQL in comparison to other procedural languages

A Secured Collaborative Model for Data Integration in Life Sciences

173

Stipend

Dept

Name

Id

such as C and PL/SQL, and declarative languages such as Datalog. In the remainder of this section, we discuss the insuﬃciency of SQL views as an advanced database programming construct in an attempt to justify the development of the concept of parameterized views and relational uniﬁcation. Consider a simple student database schema as shown in ﬁgure 12, and consider two derived or virtual attributes income deﬁned as shown, using dotted lines in the entity sets employee and student, and bonus deﬁned only in employee. The idea here is that the Income attribute in employee is more specialized and hence overrides the deﬁnition in student, in a way similar to object-oriented systems. However, since grad and employee are specializations of student, they inherit all the attributes – static (shown using solid lines) and virtual, of student. Assume also that the SQL DDL statements in ﬁgure 13 are used to deﬁne the schema of these relations.

grad

Income = Salary

is−a

Salary

Advisor

student

Bonus = Income * .10

Income = Stipend

employee

Fig. 12. Student database ER schema

There could be several implementation strategies to map this ER-like schema into a relational schema. One possible representation is shown in ﬁgure 14 where tuples are placed at the level where they actually belong, instead of splitting them in general and specialized entity sets. Notice that the income and bonus attributes are not shown as they are virtual and could be implemented using SQL query or view deﬁnitions. Also, Joe and Marni are both graduate students and employed as teaching assistants, and so they appear in both grad and employee tables with appropriate attributes.

create table student ( Id integer, Name char(10), Dept char(10), Stipend integer )

create table employee ( Id integer, Name char(10), Dept char(10), Salary integer )

create table grad ( Id integer, Name char(10), Dept char(10), Stipend integer, Advisor char(10) )

Fig. 13. Relational schema deﬁnition

Now let us consider a view v1 called ta info in ﬁgure 15 constructed from the relations grad and employee to respond to queries that include income and bonus

174

H. Jamil

as attributes in addition to the other static attributes of grad and employee. We implement the views v2 : ta income and v3 : ta bonus as shown in ﬁgure 15 corresponding to income and bonus and use them to deﬁne the view ta info. What is worth noting here is that the ta info view is concretely deﬁned for CS teaching assistants only. It is not possible to ﬁnd out information about all or for teaching assistants from PH department using ta info view because this view has already been ﬁltered away. We could redeﬁne ta income view shown as v2 in ﬁgure 15 and then execute a query over ta info as follows: select * from ta info or select * from ta info where Dept = ”PH” The view ta info now computes information for all departments and we are able to choose. Of note here is that in the latter approach, we did not have the opportunity to push the selection constraint Dept = ”PH” to the view deﬁnition for ta income, and thus more time was spent computing redundant information that is discarded eventually. student Id

Name

27

Oishii

45

Sharaar

Dept

grad

employee

Stipend

Id

Name

Dept

Stipend

Advisor

Id

Name

Dept

Salary

CS

1000

39

Susan

CS

1200

Carter

61

Brenda

CS

3600

EE

800

88

Tony

ME

2800

Miller

22

Smith

PR

2800

12

Joe

PH

1200

Pots

12

Joe

PH

1600

82

Marni

CS

1000

Renee

82

Marni

CS

1800

Fig. 14. Student database instance

On the contrary, systems such as Datalog oﬀer a neater way of managing such scenarios through opportunities to optimize naturally. For example, consider the set of Datalog rules below. r1 : ta info(Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus) ← grad(Id, Name, Dept, Stipend, Advisor), ta income(Id, Income, Dept), employee(Id, Name, Dept, Salary), ta bonus(Id, Bonus). r2 : ta income(Id, Income, Dept) ← grad(Id, , Dept, Stipend, ), employee(Id, , Dept, Salary), Income = Salary + Stipend. r3 : ta bonus(Id, Bonus) ← ta income(Id, Income, ), Bonus = Income * .10. Recall that these intensional database (IDB) rules in Datalog are evaluated on demand following a query, whereas the views in SQL are computed right away

A Secured Collaborative Model for Data Integration in Life Sciences

v1 :

create view ta info as select Id, Name, Dept, Stipend, Advisor, Salary, Course, Income, Bonus from employee as e, grad as g, ta income as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id

v2 :

create view ta income as select Id, Income = Stipend + Salary, Dept from employee as e, grad as g, where e.Id = g.Id and Dept = ”CS”

v3 :

create view ta bonus as select Id, Bonus = Income * .10 from ta income

v2 :

create view ta income as select Id, Income = Stipend + Salary, Dept from employee as e, grad as g, where e.Id = g.Id

175

Fig. 15. View deﬁnitions on student database

and maintained. In SQL, only the queries get executed at run time in an ad hoc fashion. In this way, Datalog rules (although equivalent to views in SQL) behave like procedures, and the queries take the role of procedure calls that provide appropriate variable adornments. Given the rules r1 through r3 , we now have many choices to ask queries that were not possible with SQL views so easily. Now consider the queries below and their SQL and Datalog representations as shown in ﬁgure 16 based on view deﬁnitions v1 , v2 and v3 , Datalog rules r1 through r3 , and their responses in ﬁgure 17. q1 : List names of all CS teaching assistants and their bonuses. q2 : List names of all CS teaching assistants, their salaries and bonuses. q3 : List names of all physics teaching assistants and their bonuses. What is interesting to note is that to evaluate the Datalog version of the query q1 through q3 , the constants ”CS” and ”PH” propagate throughout the rules due to sideways information passing and uniﬁcation [10,3] and help optimize the query evaluation. A similar optimization is not possible in existing SQL because variable adornments cannot be passed from one view to the next. As a consequence, in SQL, we are forced to write a separate view for every possible optimized invocation or to write a general enough view and select a desired subset as needed thus rendering the process restrictive, ineﬃcient, and expensive. The queries in ﬁgure 16 also expose several interesting properties of SQL and Datalog that are relevant for our ensuing discussion on parameterized views. Query q1 shows that SQL supports arity and position independence whereas in Datalog, name independence is supported while they both produce the relation in ﬁgure 17 (a) as an answer. By arity and position independence we mean that

176

H. Jamil q1 : SQL

: select Name, Bonus from ta info where Dept = ”CS” Datalog : ? ta info( , X, ”CS”, , , , , Y). : select Salary, Name, Bonus from ta info where Dept = ”CS” Datalog : ? ta info( , X, ”CS”, , , S, , Y).

q2 : SQL

: select Bonus, Name from ta info where Dept = ”PH” Datalog : ? ta info( , X, ”PH”, , , , , Y).

q3 : SQL

Fig. 16. SQL and Datalog versions of queries q1 through q4

Name

Bonus

Name

Salary

Bonus

Name

Bonus

Marni

280

Marni

1800

280

Joe

280

(a)

(b)

(c)

Fig. 17. (a) and (b) - SQL responses for q1 and q2 , and (c) Datalog response for q3

for a given scheme r(R) where R = {A, B, C}, select A, B from r where A=a is allowed even though the degree of R is 3, and will produce identical response to the query select B, A from r where A=a. However, select X, B from r is not allowed and will result in a syntax error because there is no attribute named X. That means, the attribute name used in the query must appear in the scheme, but its position listed is immaterial. In contrast, in Datalog, ? r(A, B) is not allowed (technically speaking, it will produce nothing – a logical false), but ? r(A, B, C) or ? r(A, B, ) are. That means, the degree or arity of the predicate and the relation must match. However the variable names (or attribute name) does not matter at all since the queries ? r(A, B, C), ? r(B, A, C) and ? r(X, Y, Z) will produce identical responses including the order of the columns. In contrast, although select A, B, C from r and select B, A, C from r will produce mathematically identical relations, the order of the attributes will be diﬀerent. 4.2

Necessity of Parameterized Views

It is our contention that if we could mimic Datalog’s side ways information passing [3] to be able to propagate attribute values from one view to another as parameters, we could actually make remote hidden web sites accept user constraints in the form of constants. If the supplied constants can be used to ﬁlter unwanted data, we could improve the computational eﬃciency at the server side.

A Secured Collaborative Model for Data Integration in Life Sciences

177

In other words, we could impose further restrictions on the view implementing a hidden web query form by pushing selection conditions inside. The goal here is to facilitate generic view deﬁnitions that can be evaluated on demand in a way similar to Datalog using variable bindings, as shown in example 1. Example 1. Consider simulating the Datalog version of the views ta info and ta income. If we could write these views as shown in ﬁgure 18, and supply the constant ”CS” when we reference ta info, we could achieve our goal. Notice the use of a call statement in the from clause of view v4 : ta info that passes the Dept value as D to view v5 : ta income which utilizes it in its where clause for selecting the appropriate subset of tuples. The ta info view is activated or ﬁred through a similar call statement query q4 as shown in ﬁgure 18.

v4 :

create view ta info with parameter D as select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Dept = D

v5 :

create view ta income with parameter D’ as select Id, Income = Stipend + Salary from employee as e, grad as g, where e.Id = g.Id and Dept = D’

q4 :

call ta info with (D/”CS”)

Fig. 18. Simple parameterized views and view queries via call statements

While this approach works ﬁne, it has a problem that we must address. Consider the query List all teaching assistants and their salaries. While the Datalog version of this query, i.e., ? ta info( , X, , , , , , Y), will have no trouble computing the answer, the extended SQL just proposed will fail without further machineries. This is because we now need to call the view ta info with call ta info with () without a variable assignment, which breaks down the syntactical structure and possibly raises a runtime exception since the expression Dept = D is not computable. Ideally, we should still be able to use the view and the call should succeed just like in its Datalog counterpart. To achieve this functionality, we introduce the concept of relational don’t care variables and relational uniﬁcation. We will address these issues in the next section and see how these two concepts become useful in developing the framework of paremeterized views.

178

H. Jamil

Limitations of Macro Expansion and Datalog. The issue of parameter passing to procedures using macro expansion has been a subject of research for some time and such techniques have been utilized in several commercial database systems. It can be argued that macro expansion is powerful enough to handle the functionalities discussed in this section, in the context of the notion of parameterized views and uniﬁcation discussed shortly. We would like to point out that macro expansion cannot propagate arguments from one view to the next in a way similar to Datalog (sideways information passing). Macro expansion is also static in nature, and is usually handled at compile time (and thus has limited use). In contrast, parameterized views presented here are dynamic in nature and are handled at run time. More importantly, macros, or techniques of macro expansion, cannot be eﬀectively used for knowledge processing and reasoning applications because they require the power of uniﬁcation, and they are not, in any way, comparable to uniﬁcation. Freire [13] points out, it is not practical to achieve view parameterization through the integration of deductive engines with SQL, even though an argument for such an approach can be made seemingly naturally. Then the natural question is, why not choose Datalog? The obvious reason is that almost no production system uses deductive engines such as Datalog as a computational platform, and hence, developing a collaborative model relying on it will not be practical or useful. The technical reason is far more serious even if the hidden web systems had used Datalog as their query processing engines for the following reasons. As discussed in sections 4 and 4.1, the layers of separation in the hidden web implementation requires that we use some form of schema mapping and establish attribute correspondence across the layers. Since schema mapping is inherently related to attribute names (or terms in logic programming terminology), Datalog’s name independence will interfere, and it will be extremely diﬃcult, if not impossible, to develop a mapping algorithm that will successfully establish the needed schema correspondences. Not only will Datalog’s name independence pose a serious problem, so will its reliance on attribute position for the identiﬁcation of columns in a relation. So, we believe that parameterized views, as we envision in this paper, is a prudent approach to address the propagation of user constraints to hidden web servers. It is important to note that the parameterized view we introduce here has its own application in relational databases, especially in terms of its capability in reasoning with SQL. 4.3

Syntax and Semantics of Parameterized Views

In this section, we formally deﬁne the syntax and semantics of parameterized views in relational databases based on the following principles. 1. We must preserve SQL’s tradition of supporting arity independence in accessing a relation, stored (base relations) or derived (views or procedures). This principle is a carryover from traditional practices in SQL. 2. We should be able to use a view at any point in any SQL expression where a relation can be used. Since our goal is to customize views through parameter

A Secured Collaborative Model for Data Integration in Life Sciences

179

passing, it makes sense that we allow parameterized views or procedures in just such a fashion in SQL expressions. 3. We should be able to adorn the variables in the procedure from outside when we reference it in a dynamic fashion at run time. Keeping these goals in mind, we identify parameterized views as transient views, very much like queries. But unlike queries, they are not evaluated at the point of deﬁnition. Instead, they are evaluated every single time they are referenced in other long traditional (long lived) views, queries, or are called using a call statement (to be deﬁned shortly). We borrow the idea of formal parameter lists in procedural languages albeit uniquely to deﬁne the syntax and semantics of parameterized views. We introduce the with parameter clause in create view statement which deﬁnes a maximal set of allowed input parameters for the view. If we treat the attribute list of the select clause within the create view statement as the maximal output list, then the interface to the view in extended SQL is the union of the variables in with parameter and the select clauses as formalized in the deﬁnition below. Definition 1. Let D be a database, R be a set of relation names in D, and A be a set of attribute or tag names. Also let v ∈ R, E1 , . . . , El be a set of relational expressions involving relations in D, {A1 , . . . , An } be a set of expressions involving attributes of E1 , . . . , El and a set of tags P1 , . . . , Pm such that {P1 , . . . , Pm } ⊆ A. Similarly, let C1 , . . . , Ck be Boolean conditions involving attributes of E1 , . . . , El and the tags P1 , . . . , Pn . Then, create view v with parameter P1 , . . . , Pm as select A1 , . . . , An from E1 , . . . , El where C1 ∧ . . . ∧ Ck is a parameterized view.

The parameter list P1 , . . . , Pm is used to specify replaceable tags and are being used in the view deﬁnition as place holders. The behavior exhibited by the view depends on these place holders, as they play a vital role in the restriction of the computed view. To evaluate the view, we use the call statement deﬁned below. The call statement invokes the view with a speciﬁed set of bindings for a subset of the tags appearing in the parameter list of the view. Definition 2. Let R be a set of relation names, A be a set of attributes, v ∈ R be a parameterized view with a parameter list L, {A1 , . . . , An } ⊆ L ⊆ A be a set of attributes, and ﬁnally c1 , . . . , cn be a set of constants. Then, call v with (A1 /c1 , . . . , An /cn ), and call v are procedure calls.

180

H. Jamil

Notice that the set A1 , . . . , An in the with clause of the call statement must be an improper subset of the parameter list of the view v. The second form of the call statement allows the possibility of evaluating a view without any parameter passing – i.e., no variable binding, the case where n = 0. In this way, it allows us to capture arity polymorphism of the view deﬁnition and the parameter list of v becomes a maximal set. This means that we are allowed to use all or a subset of parameters. But we cannot call a parameterized view with an argument that is not a member of its parameter list. Unfortunately, if {A1 , . . . , An } ⊂ {P1 , . . . , Pm }, i.e., a strict subset, then there is a possibility that a run-time error may occur at execution time as the expressions speciﬁed in the view may become fragile or non-computable, as demonstrated in the following example. Notice that every view is a relation, and a call to a parameterized view produces a relation by computing the view as a query. Hence, a call statement can be used in an expression where a traditional view or a relation can be used with appropriate renaming as necessary (i.e., as clause). In this way, the closure property of declarative views is also preserved. Consider redeﬁning the ta info view as follows and calling it with the series of call statements shown next to it. The ﬁrst call initializes the variables D and I in ta info with the supplied constants. The computed view is the relation shown in ﬁgure 17 (a). Notice that the order of the argument list is immaterial. Current implementation of Oracle supports this form of named assignment of variables in procedure calls when defaults are declared in the procedure for its arguments. Essentially, the argument list in the call statement speciﬁes a set of substitutions for the variables in the view expressions. Once the substitutions are made, the query expressed in the select clause can be executed to compute the answer.

v6 :

create view ta info with parameter I, D as select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Dept = D and Id = I

q5 : call ta info with (D/”CS”, I/82) q6 : call ta info with (I/82) The ﬁrst procedure call q5 results in the following query from the parameterized view v6 : ta info. So, basically the query

q7 :

select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Dept = ”CS” and Id = 82

is executed, but not the view

A Secured Collaborative Model for Data Integration in Life Sciences

v7 :

181

create view ta info select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Dept = ”CS” and Id = 82

as in traditional SQL. This approach makes it really transient and ad hoc, and so the computation does not survive from statement to statement. Hence, a call statement can be used as an expression where a query, view or a relation is used, as noted before. We, however, have a problem with the second through the fourth call statement above because the substitutions are partial and the where clause in view ta info becomes fragile due to the presence of non-evaluable conditions, i.e., Dept = D, Id = I, and Dept = D and Id = I, respectively. There are two ways to deal with this situation. The ﬁrst is to disallow evaluation of such views at run time. If needed, we may disallow the call itself to avoid generating a run-time error by requiring that the substitution list of the call statement matches the parameter list of the view, i.e., for a view parameter list P1 , . . . , Pm , and a call substitution list A1 /c1 , . . . , An /cn , m = n. But this choice restricts the possibility of several interesting ways the call can be perceived. Another way to deal with this issue is to ignore expressions that do not have a ground substitution. Similar to Datalog, if we take the view that every omitted substitution in the with clause of the call statement is a “don’t care” term, then we would like to think that the queries

q8 :

select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Dept = ”CS”

q9 :

select Id, Name, Dept, Stipend, Advisor, Salary, Income, Bonus from employee as e, grad as g, call ta income with (D’/D) as i, ta bonus as b where e.Id = i.Id and i.Id = b.Id and b.Id = g.Id and Id = 82

and

resulting from calls q4 and q6 respectively are the likely scenarios in the relational context. While don’t care variables are well understood and pose no threat in logic programming paradigm, in relational context, their use in parameterized views, however, raises some interesting questions. In relational model, don’t care terms (attributes) are modeled by omitting them in the select clauses. In the context of parameterized views, we are proposing that we treat a predicate involving parameters as don’t care terms if certain conditions are met. To guarantee the production of the queries q8 and q9 above corresponding to the calls q4 and q6 , we proceed as follows to deﬁne evaluability of expressions and queries.

182

H. Jamil

Definition 3. Let D be a database, V be a variable name and E be a relational expression over D. Variable V is non-prime if V is not an attribute name of a relation in database D. Definition 4. Let E be an expression and {A1 , . . . , An } be the set of tags appearing in E. Expression E is evaluable if E is free of non-prime variables, i.e., E contains only constants or attributes of some relations in the database. An SQL query is evaluable if all expressions in select and where clauses are evaluable. Our goal is to allow parameterized views of the form ta info yet guarantee query safety, and thus ensure that every expression in a query is evaluable at run time even when a subset of parameters are used in the call statement. To guarantee query safety, we deﬁne relational don’t care expressions as follows. Definition 5. Let Q be an SQL query and E be an expression in the select or where clause of Q. Then E is a relational don’t care expression if E is not evaluable. During execution, don’t care expressions in select clause are treated as null, and as true in where clause. Hence, it is safe to remove don’t care expressions from the select list, and from the where clause. Under these assumptions, queries q4 and q6 on view v6 would result in the execution of the queries q8 and q9 as intended. Note that the introduction of the concept of relational don’t care expressions gives rise to the following two deﬁnitions of safe and unsafe queries and of intended deﬁnitions of parameterized views. Definition 6. Let v be a parameterized view, and L be its parameter list. Let C be a call statement for v with a substitution list S. If T is the set of tags in S then, C is safe if T ⊆ L, otherwise it is unsafe. Definition 7. Let v be a parameterized view, and L be its parameter list. Let D be the set of queries obtained from v by replacing expressions in v involving elements l ∈ L in all possible ways (there are exactly |P(L)| possible ways) as follows: – replace an expression involving l in the select clause with null. – replace an expression involving l in the where clause with true. – replace an expression involving l in the from clause with empty. Then, the set D is the intended deﬁnitions of v. 4.4

Relational Unification Algorithm

Before we proceed to develop a relational uniﬁcation algorithm to support processing parameterized views, we need to make a distinction between variables that are external to an expression as opposed to those that are internal. Because our goal was to support arity independence following the tradition of SQL, we need to develop a method to simulate name independence so that names in other expressions do not conﬂict with the variable names in an expression at hand. This will help control accidental instantiation of variables across expressions, and alleviate concern about name duplication in other expressions.

A Secured Collaborative Model for Data Integration in Life Sciences

183

Definition 8. Let E be a relational expression, and V be the set of variables appearing in E. A variable var ∈ V is external to E if it is in a substitution list of a call statement in the form var/e, it is internal otherwise. The essence of deﬁnition 8 is that a variable (i.e., var) that refers to a parameter of a parameterized view in a substitution list of a call statement embedded in an expression E is independent of the expression E. So, any possible substitution in E must not eﬀect var. Recall the view ta info deﬁned in example 1. In that deﬁnition, we needed to separate the variable D in ta info with the variable D in view ta income. Otherwise, any uniform replacement of D in ta info would eﬀect ta income, or may even result in meaningless substitutions (i.e., as in this case ”CS”/”CS”). Now, we can use the idea of external variables to our advantage as shown in the example 2 below, yet support name independence from expression to expression. The uniﬁcation algorithm we present in this section utilizes this notion. algorithm: uniﬁcation input: a parameterized view v and a substitution list L. output: a select query, q. begin let {A1 , . . . , Ak } be the set of external variables in L; let {P1 , . . . , Pn } be the set of parameters in v; if {A1 , . . . , Ak } ⊆ {P1 , . . . , Pn } then report runtime uniﬁcation error; exit; apply substitution uniformly on the internal variables of view v to obtain v as v = v[{A1 /c1 , . . . , Ak /ck }]; extract select query from v as q; for every select clause expression e of q if e is not evaluable then remove e from q; for every where clause condition expression c of q if c is not evaluable then replace c with Boolean constant true; return q; end. Fig. 19. Relational uniﬁcation algorithm

Example 2. Consider the views v4 and v6 and the query q4 . The algorithm for relational uniﬁcation presented in ﬁgure 19 will produce the query q8 on both views v4 and v6 . Notice that variable D in the subexpression call ta income with (D/”CS”) as i is untouched as it is external to this expression and the fact that view ta income also has a variable name D is not an issue. In logic programming, this is a by product of uniﬁcation and substitution. The goal of relational uniﬁcation is somewhat diﬀerent from its logic programming counterpart. In logic programming, the goal is to make the target predicate

184

H. Jamil

(consequent of a rule, and thus the body) look as identical as possible to the query predicate. The safety of the evaluation is not an issue. But in relational uniﬁcation, we make it a point that the target query remains safe, evaluable and intended as deﬁned in deﬁnitions 6, 4 and 7 respectively. For this, we rely on the parallel that exists in logic programming, and hence the approach remains grounded on sound foundations. For any extended SQL call statement C, and a parameterized view v, algorithm uniﬁcation always returns an evaluable query which is intended if, and only if, C is safe and relational uniﬁable, or else it returns an error message.

5

Constraint Injection into Hidden Web Form Views

We are now equipped to formally deﬁne the concept of constraint injection to SQL views. We will utilize the fact that web forms already accept user constraints as a set of attribute-value pairs that they plug into view deﬁnitions for onward evaluation, and that we now have a mechanism to inﬂuence the restriction operation or the selection condition of views through parameterization. We formally deﬁne the concept of constraint injection as follows. Definition 9. Let W be a hidden web form, and Vw be the set of form variables appearing in W . Let V be the parameterized view to compute the query response R such that Vw appears in the Boolean condition of V as CVw . Let Vr be the set of attributes in the computed view R by V . Then, V is a fully parameterized view if the parameter list L of V is equal to Vw ∪ Vr . We assume that applications are already aware of the form variables, and anticipate ﬁnding a table in response to the form submission that will have an associated scheme. The application also expects to apply a selection condition θ and project out the unwanted columns. In other words, for a relation r, we have a user supplemental query Q of the form select A1 , . . . , An from r where θ on the view r. Our goal is to transmit this query to W and enforce θ as an additional constraint on a fully parameterized view V and apply projection ΠA1 ,...,An (r). Since the view V is still a black-box, we will apply our best effort model to deﬁne an evaluable parameterized view as follows. Definition 10. Let V be a fully parameterized view with deﬁned constraint C, a formal parameter list L, and a select list with Lv . Let Q be a supplemental query on V with projection list P and constraint θ. A constrained view Vc is a rewriting of V such that the select list Lv is replaced with P and the constraint C is replaced with C ∧ (θ). Although we have modiﬁed the view to behave as though the original view V has been computed as a table r, and we are now applying the supplemental query on

A Secured Collaborative Model for Data Integration in Life Sciences

185

r, we still need to guarantee that the rewriting is syntactically and semantically correct. This is because the project list P may not be appropriate for V or the condition θ may not be enforceable because of the schema mismatch. To ensure that we have a best eﬀort view that is computable at site W , we deﬁne an admissible view as follows. Definition 11. Let Vc be a constrained view of a fully parameterized view V , Q be a supplemental query, μ be a binary mapping function that uniquely maps a set of terms to another set of terms, and Vq be the set of all attribute names in the select list of Q and condition θ. Vc is strictly admissible if ∀t, t ∈ Vq → (∃u, u ∈ Vr ∧ μ(t, u) = 0). It is best eﬀort admissible if ∀t, t ∈ Vq → (∃u, u ∈ Vr ∧ μ(t, u) = 0) holds, or t (or condition involving t in θ) is removed from the project list P (or boolean term involving t is replaced with the constant true in θ) otherwise. Deﬁnition 11 makes it possible to either enforce all requirements of supplemental query Q (strict admissibility), or best eﬀort enforcement by removing the projections and selection conditions that are not enforceable. By changing a condition to true for terms that are not computable (because the attribute does not exist in the view), we retain the logical structure of θ unaltered. Once the schema mapping is completed, and we know if we are able to enforce strict or best eﬀort admissibility, we are in a position to honor application request, or reject it. All an application now has to do is send the web form entries as a pair of attributes and values, a project list, and selection condition and a choice of mapping function to the site W . Since we apply θ as a conjunct, we can never violate site security, as θ is in addition to what V was already applying (an instance of the case σa (σb (r)) ≡ σa∧b (r)). Finally, the technology we have proposed for parameterized views handles the propagation of these constraints in θ, called constraint injection, deeper into the view deﬁnition appropriately.

6

Summary

Our central focus in this paper was to propose a model of collaboration for Life Sciences databases to improve overall computational eﬃciency. By introducing the notion of parameterized views and then admissible constrained views, we have demonstrated that user constraints can now be enforced on the server side view computation without any security breach or SQL injection. We emphasize that parameterized views can be used as a substitute for hidden web interface implementation, a form of module construct to avoid string substitution based interface query rewriting that initially led to SQL injection breach in traditional systems. We believe that the techniques presented in this paper, though simple, for the ﬁrst time make it possible for remote sites to accept user directed computation inside the hidden web in a secure and collaborative manner without the risk of SQL injection. A full treatment of a more formal model for constraint injection was outside the scope of this paper due to space constraint.

186

H. Jamil

A more detailed discussion on this issue may be found in [22] where we introduce the idea of mandatory and optional parameters and avoid using relational don’t care variables (as we did in this paper), and introduce the idea of interface views. Constraint injection is given a formal status by deﬁning a simple notion of query containment along the line of earlier research in [1,12,9,14,25]. The presented model respects site autonomy and improves overall query eﬃciency with site participation. What remains to be explored is how the coordination plan can be developed, transmitted to participating sites and implement the overall architecture which we plan to address in a separate paper. We see opportunities to utilize existing research on distributed transaction processing, and commit protocols in database literature toward developing query plans as part of our future research endeavors.

References 1. Afrati, F.N., Damigos, M., Gergatsoulis, M.: Query containment under bag and bag-set semantics. Inf. Process. Lett. 110(10), 360–369 (2010) 2. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: An extensible system for design and execution of scientiﬁc workﬂows. In: SSDBM, p. 423 (2004) 3. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic sets and other strange ways to implement logic programs. In: PODS, pp. 1–15 (1986) 4. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Res. 36(database issue) (January 2008) 5. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000) 6. Bhattacharjee, A., Islam, A., Amin, M.S., Hossain, S., Hosain, S., Jamil, H., Lipovich, L.: On-the-ﬂy integration and ad hoc querying of life sciences databases using LifeDB. In: 20th International Conference on Database and Expert Systems Applications, Linz, Austria, pp. 561–575 (August 2009) 7. Boulakia, S.C., Biton, O., Davidson, S.B., Froidevaux, C.: Bioguidesrs: querying multiple sources with a user-centric perspective. Bioinformatics 23(10), 1301–1303 (2007) 8. Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data integration for the relational web. PVLDB 2(1), 1090–1101 (2009) 9. Calvanese, D., Giacomo, G.D., Lenzerini, M., Vardi, M.Y.: View-based query containment. In: PODS, pp. 56–67 (2003) 10. Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1), 146–166 (1989) 11. Chen, L., Jamil, H.M.: On using remote user deﬁned functions as wrappers for biological database interoperability. International Journal on Cooperative Information Systems 12(2), 161–195 (2003) 12. Farr´e, C., Teniente, E., Urp´ı, T.: Checking query containment with the cqc method. Data Knowl. Eng. 53(2), 163–223 (2005) 13. Freire, J.: Practical problems in coupling deductive engines with relational databases. In: Proceedings of the 5th KRDB Workshop, Seattle, WA, pp. 11-1– 11-7 (May 1998)

A Secured Collaborative Model for Data Integration in Life Sciences

187

14. Grahne, G., Thomo, A.: Query containment and rewriting using views for regular path queries under constraints. In: PODS, pp. 111–122 (2003) 15. Guo, S., Dong, X., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. PVLDB 3(1), 417–428 (2010) 16. Gusﬁeld, D., Stoye, J.: Relationships between p63 binding, dna sequence, transcription activity, and biological function in human cells. Mol. Cell 24(4), 593–602 (2006) 17. He, B., Zhang, Z., Chang, K.C.C.: MetaQuerier: querying structured web sources on-the-ﬂy. In: SIGMOD Conference, pp. 927–929 (2005) 18. Hosain, S., Jamil, H.: An algebraic foundation for semantic data integration on the hidden web. In: Third IEEE International Conference on Semantic Computing, Berkeley, CA (September 2009) 19. Hossain, S., Jamil, H.: A visual interface for on-the-ﬂy biological database integration and workﬂow design using VizBuilder. In: 6th International Workshop on Data Integration in the Life Sciences (July 2009) 20. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workﬂows of services. Nucleic Acids Res. 34 (July 2006); web Server issue 21. Jamil, H., Islam, A., Hossain, S.: A declarative language and toolkit for scientiﬁc workﬂow implementation and execution. International Journal of Business Process Integration and Management 5(1), 3–17 (2010); iEEE SCC/SWF 2009 Special Issue on Scientiﬁc Workﬂows 22. Jamil, H., Jagadish, H.V.: Accepting external constraints on deep web database query forms and surviving it. Tech. rep., Department of Computer Science, Wayne State University, Michigan (June 2011) 23. Kent, J.W., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., Haussler, D.: The human genome browser at ucsc. Genome Res. 12(6), 996–1006 (2002) 24. K¨ opcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010) 25. Penabad, M.R., Brisaboa, N.R., Hern´ andez, H.J., Param´ a, J.R.: A general procedure to check conjunctive query containment. Acta Inf. 38(7), 489–529 (2002) 26. Roichman, A., Gudes, E.: Fine-grained access control to web databases. In: SACMAT, pp. 31–40 (2007) 27. Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: eﬃcient and scalable discovery of composite keys. In: VLDB 2006, pp. 691–702 (2006) 28. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identiﬁcation rules for information integration. Inf. Syst. 26(8), 607–633 (2001) 29. Wang, K., Tarczy-Hornoch, P., Shaker, R., Mork, P., Brinkley, J.: Biomediator data integration: Beyond genomics to neuroscience data. In: AMIA Annu. Symp. Proc., pp. 779–783 (2005) 30. Yakout, M., Elmagarmid, A.K., Elmeleegy, H., Ouzzani, M., Qi, A.: Behavior based record linkage. PVLDB 3(1), 439–448 (2010)

Flexible-ICA Algorithm for a Reliable Iris Recognition Imen Bouraoui1, Salim Chitroub1, and Ahmed Bouridane2,3 1

Signal and Image Processing Laboratory, Electronics and Computer Science Faculty, USTHB, P.O. Box 32, El – Alia, Bab – Ezzouar, 16111, Algiers, Algeria 2 School of Computing, Engineering and Information Sciences, Northumbria University, Pandon Building Newcastle upon Tyne, UK 3 Department of Computer Science, King Saud University, P.O. Box 2454, Riyadh, 11451, Saudi Arabia [email protected], [email protected] [email protected], [email protected]

Abstract. In many large scale biometric-based recognition problems, knowledge of the limiting capabilities of underlying recognition systems is constrained by a variety of factors including a choice of a source encoding technique, quality, complexity and variability of collected data. In this paper, we propose a novel iris recognition system based-on Independent Component Analysis (ICA) encoding technique, which captures both the second and higher-order statistics and projects the input data onto the basis vectors that are as statistically independent as possible. We apply Flexible-ICA algorithm in the framework of the natural gradient to extract efficient feature vectors by minimizing the mutual information of the output data. The experimental results carried on two different subsets of CASIA V3 iris database show that ICA reduces the processing time and the feature vector length. In addition, ICA has shown an encouraging performance which is comparable to the best iris recognition algorithms found in the literature. Keywords: Biometrics, Iris recognition, Feature extraction, Flexible-ICA, CASIA-V3 iris database.

1 Introduction In recent years, with the increasing demands of security in our networked society, technologies for personal authentication are becoming one of the main solutions to safeguard people’s privacy and properties. Traditional user authentication schemes are based on passwords, secret codes and/or identification cards or tokens, can be cracked by intercepting the presentation of such a password, or even by counterfeiting it (via passwords dictionaries or, in some systems, via brute force attacks). On the other hand, an intruder can attack systems based on identification card or tokens by robbing, copying or simulating them. In this case, biometric technology becomes an attractive alternative recognition technique [1]. Biometric authentication deals with recognizing the identity of individuals based on their unique physical or behavioural characteristics. Physical characteristics such as fingerprint, palm print, hand A. Hameurlain et al. (Eds.): TLDKS IV, LNCS 6990, pp. 188–208, 2011. © Springer-Verlag Berlin Heidelberg 2011

Flexible-ICA Algorithm for a Reliable Iris Recognition

189

geometry, face, ear, voice, and iris patterns or behavioural attributes such as gait, typing pattern and handwritten signature present information that is specific to the person and can be used in the authentication process [2]. Out of all these biometric features, fingerprint verification has received considerable attention and has been successfully used in law enforcement applications. Face recognition and speaker recognition have also been widely studied over the last years, whereas iris recognition is a newly emerging approach to personal identification [3], [4]. It is reported in [5] that iris recognition is one of the most reliable biometrics. The iris is an internal organ of the eye that is located just behind the cornea and in front of the lens. The functionality of the iris is to control the size of the pupil, which in turn regulates the amount of light entering the pupil and impinging on the retina. The visible structures around the iris are the pupil, sclera and cornea. The pupil lies near the centre of the iris. It appears dark because most of the light entering the pupil is absorbed by the tissues inside it. The sclera is commonly known as “the white of the eye”. It is the eye's protective outer cover. The cornea is the transparent part of the human eye which covers the iris, pupil and the anterior part of the human eye [35]. The iris is an annulus structure consisting of fibro-vascular tissues called stroma. It is divided into two regions; pupillary zone which is the inner region of the iris on the periphery of the pupil and the ciliary zone which constitutes the rest of the iris as shown in Figure 1. These two regions are separated by the collarette which typically represents the thickest part of the human iris. Furrows found in the stroma on the circumference of the iris have a circular structure. They are termed as circular furrows. The majority of the furrows in the stroma is interlaced and radiate towards the pupil (dilator muscles) and are termed as radial furrows. Some circular furrows (sphincter muscles) are present just outside the pupil in a narrow band of about 1 mm. The stroma connects the sphincter muscles, which contract the pupil, with the dilator muscles, which dilate the pupil. The iris also contains sharply demarcated crypts that are a result of iris thinning which exposes the darkly pigmented posterior layer. Thus, the iris in the presence of near-infrared lighting, is observed to have several features including radial and concentric furrows, crypts and the collarette, all of which contribute to its uniqueness and play a significant role in recognising an individual. Ophthalmologists originally proposed that the iris of the eye might be used as a kind of optical fingerprint for personal identification [6]. Their proposal was based on clinical results that every iris is unique and it remains unchanged in clinical photographs. The human iris begins to form during the third month of gestation. The structure is complete by the eighth month of gestation, but pigmentation continues into the first year after birth. It has been discovered that every iris is unique and no two people even two identical twins have uncorrelated iris patterns [7], and is stable throughout the human life. It is suggested in recent years that the human irises might be as distinct as fingerprint for different individuals, leading to the idea that iris patterns may contain unique identification features. However, there are various medical conditions that may affect iris [36], [37], [38], [39], such as cataract, glaucoma, albinism, etc.

190

I. Bouraoui, S. Chitroub, and A. Bouridane

Fig. 1. Anatomy of iris. The two white blobs in the pupil are the specular refections due to the imaging device.

Many researchers have worked on iris recognition including image databases, and human iris authentication process basically consists of four steps as follows: (a) iris segmentation, where the iris is localized and isolated from the noise due to sclera, pupil, eyelids and eyelashes; (b) normalization, where iris is mapped from rectangular representation to domain polar representation; (c) feature extraction, where a feature vector is formed which consists of the ordered sequence of the features extracted from the various representation of the iris images; (d) and matching, where the feature vectors are classified through different techniques such as Hamming Distance, weight vector and winner selection, dissimilarity function, etc. In our work, we first use Canny edge detection and Hough transform for iris localisation. Then, the extracted iris region is normalised into a rectangular block with constant dimensions to account for imaging inconsistencies of Daugman’s model. We apply Flexible-ICA algorithm to extract efficient feature vectors. Then, each iris feature vector is encoded into an iris code. Finally, a Hamming distance is used for the matching process. We demonstrate our experimental results using two different subsets of CASIA-V3 iris image database and some mathematical criteria, in order to compare our technique against some other existing methods in order to assess its usefulness. The remainder of this paper is organized as follows. In Section 2 we briefly review some existing and important iris recognition systems. Section 3 is concerned with description of iris image pre-processing, which involves iris localization, and normalization. Basic principles of iris feature extraction using flexible-ICA algorithm are reviewed in detail in Section 4. Section 5 introduces matching technique. Section 6 reports the experiments and results. And finally, a conclusion and future work are given in Section 7.

Flexible-ICA Algorithm for a Reliable Iris Recognition

191

2 Overview of Some Iris Recognition Systems In this section we briefly review a sample of iris recognition works available in the literature with respect to the basic constituting modules, namely: iris preprocessing including location and normalisation, feature extraction and encoding, and matching. The idea of automating iris recognition was proposed by Flom and Safir, in 1987. They obtained a patent for an unimplemented conceptual design of an automated iris biometric recognitions system [8]. Their description suggested highly controlled conditions, including a headrest, a target image to direct the subject’s gaze, and a manual operator. To account for the expansion and contraction of the pupil, they suggested changing the illumination to force the pupil to a predetermined size. While the imaging conditions that they describe may not be practical, some of their other suggestions have clearly influenced later research. For example, they suggested using pattern recognition tools, including difference operators, edge detection algorithms, and the Hough transform, to extract iris descriptors. To detect the pupil, they suggest an algorithm that finds large connected regions of pixels with intensity values below a given threshold. They also suggested that a description of an individual’s iris could be stored on a credit card or identification card to support a verification task. One of the well-known and thoroughly tested algorithms is due to Daugman [9], [19]. The Daugman algorithm first locates the pupillary and limbic boundaries of the iris using an integro-differential operator that finds the circles in the image where the intensity changes most rapidly with respect to changes in the radius. Once located, the iris image is converted from Cartesian form by projecting it to onto a dimensionless pseudo-polar coordinate system. The iris features are encoded and a signature is created using a 2-D complex-valued Gabor filter, where the real and imaginary parts of each outcome are assigned a value of 0 or 1 according to whether they are negative or positive, i.e. only the quadrant of the phase is encoded. Finally, two images are said to be independent if their fractional Hamming distance is above a certain threshold. Otherwise they are match with the Hamming distance being equals to the number of mismatching bits divided by the number of compared bits. The Daugman algorithm has been essentially error-free when applied to a very large database [9]. The Wildes algorithm [10] locates the iris boundaries by creating a binary edge map using gradient-based edge detection, and then finds the centers and radii of these circles using a Hough Transform. The upper and lower eyelids are located similarly using parabolic arcs. Rather than map every iris image to a common system of polar coordinates, the Wildes algorithm compares two images by geometrically warping one image, via shifting and rotations, until it is a best fit with the other image, in the sense of minimizing the mean square distance. A Laplacian pyramid is constructed at four different resolution levels to encode the image data. Matching is achieved via an application of normalized correlation and Fisher’s linear discriminant [10]. Boles and Boashash [11] have given an algorithm that locates the pupil center using an edge detection method, records grey level values on virtual concentric circles, and then constructs the zero-crossing representation on these virtual circles based on a one-dimensional dyadic wavelet transform. Corresponding virtual circles in different images are determined by rescaling the images to have a common iris diameter. The authors create two dissimilarity functions for the purposes of matching, one using every point of the representation and the other using only the zero crossing

192

I. Bouraoui, S. Chitroub, and A. Bouridane

points. The algorithm has been tested successfully on a small database of iris images, with and without noise [11]. In [12], Lim, et al. propose alternative approaches to both feature extraction and matching, following a standard iris localization and conversion to polar coordinates relative to the center of the pupil. For feature extraction they compare the use of the Gabor Transform and the Haar Wavelet Transform, and their results indicate that the Haar Transform somewhat performs better. Using the Haar transform the iris patterns can be stored using only 87 bits, which compares well to the 2,048 required by Daugman’s algorithm, for example. The matching process uses an LVQ competitive learning neural network, which is optimized by a careful selection of initial weight vectors. Also, a new multidimensional algorithm for winner selection is proposed. Experimental results are given in [12] based on a database of images of irises from 200 people. In [13], Tisse, et al., present a modification of Daugman’s algorithm, with two major differences. The two innovations relate to the iris location and feature extraction stages. The use of a dimensionless polar coordinates and Hamming distance remain the same. To locate the iris, the Tisse algorithm applies a gradient decomposed Hough Transform to find the approximate center of the pupil, and then applies an integro-differential operator similar to Daugman’s algorithm, to find the precise locations of the iris boundaries. This combined approach has the advantage of avoiding errors due to specular reflection in the images. In the feature extraction and encoding step, Hilbert Transform is used to create an analytic image, whose output is then encoded as an emergent frequency vector and an instantaneous phase. This approach has an advantage of being computationally efficient. The Tisse algorithm has reportedly been successful when tested on a database of real iris images [13]. Tan, et al., suggest several innovations of iris recognition in [14], and [15], and then provide a comparison of different methods and algorithms. The iris is localized in several steps which first find a good approximation for the pupil center and radius, and then apply the Canny operator and the Hough transform to locate the iris boundaries more precisely. The iris image is converted to dimensionless polar coordinates, similarly to Daugman, and then is processed using a variant of the Gabor filter. The dimension of the signature is reduced via an application of the Fisher linear discrimant. The L1 distance, L2 distance (i.e. Euclidean distance), and cosine similarity measures are considered for matching. A careful statistical performance evaluation is provided for the authors’ work, and for most of the well-known algorithms mentioned above. In another approach presented by Ma et al., the quality of image is assessed with the help of Support Vector Machines [15]. The detection is done using Hough Transform and the feature vectors are generated using a set of Multichannel Spatial Filters and Even Symmetry Gabor filters to capture local texture information of the iris, which is then used to construct a fixed length feature vector. The dimensionality reduction of the feature vectors is done using Fisher Linear Discriminant. Nearest feature line method is used for iris matching. In [16] a set of one-dimensional intensity signals is constructed to effectively characterize the most important information of the original two-dimensional image using a particular class of wavelets; a position sequence of local sharp variation points in such signals is recorded as features. A fast matching scheme based on exclusive OR operation is used to compute the similarity between a pair of position sequences.

Flexible-ICA Algorithm for a Reliable Iris Recognition

193

In [17], several interesting ideas can also be found in the work of Du et al.. An edge detection is performed using the Canny method, and each iris image is then converted into standardized polar coordinates relative to the center of the pupil. The feature extraction stage is quite different from those mentioned previously, and is simple to implement. The authors use a gray scale invariant called Local Texture Patterns (LTP) that compares the intensity of a single pixel to the average intensity over a small surrounding rectangle. The LTP is averaged in a specific way to produce the elements of a rotation invariant vector. Thus the method performs a lossy projection from 2D to 1D. This vector is then normalized so that its elements sum to 1. The matching algorithm uses the “Du measure”, which is the product of two measures, one based on the tangent of the angle between two vectors p and q, and the other based on the relative entropy of q with respect to p, otherwise known as the Kullback-Liebler distance. Another paper involving Du [18], in the context of hyperspectral imaging, provides evidence that the Du measure is more sensitive than either of the other two measures. This iris recognition algorithm is quite fast and appears to be most appropriate for producing a “watchlist” [17], rather than being used for identification of specific individuals. In [30], Meena used high thresholding technique for pupil detection and major intensity change for iris detection. She has developed five different algorithms for iris recognition based on circular and radial features, Fourier transform along radial direction, circular-mellin filters, corner detection and local binary patterns. She has also reported the use of the fusion of circular-mellin and corner detection algorithm to extract features for iris pattern recognition. Vatsa et al. [31] have used thresholding and Freeman’s chain code algorithm for detecting pupil and also used linear contrast filter for detecting iris. They have used 1-D log polar Gabor wavelet and Euler numbers to extract texture features for iris pattern recognition. Monro et al. [32] used edge based segmentation for pupillary detection. They also have scanned the horizontal line through the pupil center and found jumps in gray level on either side of the pupil for iris boundary detection. They have developed an iris feature extraction method based on discrete cosine transform (DCT). They applied the DCT to overlapping rectangular image patches rotated 45 degrees from the radial axis. The differences between the DCT coefficients of adjacent patch vectors are then calculated and a binary code is generated from their crossings. In order to increase the speed of the matching, three most discriminating binarized DCT coefficients are kept, and the remaining coefficients are discarded. Few approaches based on the emergent method “Independent Component Analysis (ICA)” are also proposed in [1], [33] and [34]. Dorairaj et al. [33] used a combination of Principal Component Analysis with Independent Component Analysis to encode the iris image. They used the integro-differential operators to localize the iris and the pupil. The matching was accomplished using both the Euclidean and the Hamming distance metrics. In [1] Wang et al. proposed a frame of iris recognition, they used multi-scale approach to realize iris localization and a method to represent iris feature with ICA including fixed-point algorithm which is an approximation of Fast-ICA algorithm [40]. They obtained a correct recognition rate of 97.25%, and the computational load for calculating feature vectors was very high. However, results obtained by Bae et al. [34], using a database of 990 images including 99 subjects, show that Fast-ICA algorithm enhance the performance of iris feature extraction

194

I. Bouraoui, S. Chitroub, and A. Bouridane

which is 0.113 %, and reduced the computational complexity and feature vector size to 22.9 ms and 49 Bytes, respectively. Although significant progress has been achieved in iris recognition, some problems remain unsolved. Most of developed systems and algorithms are claimed to have exclusively high performance and generally used small image sets for performance evaluation. In our work, we describe and analyze the performance of flexible-ICA algorithm, new iris feature extraction technique, by partitioning images into the patches and reducing their dimensionality. Two different, large and real subsets of international certified CASIA iris image databases are used for testing our implemented method. We compare our results with the results obtained in [9], [15] and [16]; they used larger images set involving more than 200 subjects.

3 Image Pre-processing The iris is an annular part between the pupil (inner boundary) and the sclera (outer boundary). Therefore, a captured image cannot be expected to have only the iris part, it contains some non-useful part e.g. sclera, eyelid and pupil, therefore the iris region should be located in captured eye image, and normalized to polar array. 3.1 Iris Localisation Iris localization by definition means to isolate the actual iris region in a digital eye image by detecting the inner and outer boundaries of the iris. The eyelids and eyelashes normally occlude the upper and lower parts of the iris region. To detect the iris and pupil boundaries, Hough transform is used by involving Canny edge detection to generate an edge map. The gradients are biased in the vertical direction for the outer iris/sclera boundary while the vertical and horizontal ones are weighted equally for the inner iris/pupil boundary, as suggested in [13] and [10]. The Hough transform locates contours in an n-dimensional parameter space by examining whether they lie on curves of a specified shape. For the iris outer or pupillary boundaries and a set of recovered edge points (xi,yi ), i = 1, . . . , n, a Hough transform is defined by n

H ( xc , y c , r ) =

 h( x , y , x , y , r ) i

i

c

c

i =1

(1)

where H(xc, yc, r) shows a circle through a point, the coordinates xc, yc, r define a circle by the following equation xc2 + y c2 − r 2 = 0 In the case of edge detection for iris boundaries, the above equation becomes ( xi − xc ) 2 + ( yi − yc ) 2 − r 2 = 0

(2)

(3)

The eyelids are then isolated by first fitting a line to the upper and lower eyelid parts using a linear Hough transform. A second horizontal line is then drawn, which

Flexible-ICA Algorithm for a Reliable Iris Recognition

195

intersects with the first line at the iris edge that is closest to the pupil. The second horizontal line allows a maximum isolation of eyelid regions while a thresholding operation is used to isolate the eyelashes. 3.2 Iris Normalisation Normalization refers to preparing a localised iris image for the feature extraction process. The process involves unwrapping the iris image and converting it into its polar equivalent. It is carried out by using Daugman’s Rubber sheet model [22], [19] as shown in figure 2.The centre of the pupil is considered as the reference point and a remapping formula is used to convert the points on the Cartesian scale to the polar scale. The remapping of iris image I ( x, y ) from raw Cartesian coordinates to polar coordinates (r,θ ) can be represented as I ( x( r ,θ ), y (r ,θ )) → I ( r ,θ )

(4)

where r is on the interval [0,1] and θ is angle [0, 2π], with  x(r ,θ ) = (1 − r ) xP (θ ) + rxI (θ )   y (r ,θ ) = (1 − r ) y P (θ ) + ry I (θ )

(5)

 xP (θ ) = OxP (θ ) + rP cos(θ )   y P (θ ) = Oy P (θ ) + rP sin(θ )

(6)

where

and  xI (θ ) = OxI (θ ) + rI cos(θ )   y I (θ ) = Oy I (θ ) + rI sin(θ )

(7)

The centre of the pupil is denoted by (OxP, OyP) and (OxI, OyI) is the center of the iris; rP is the radius of the pupil and rI is the radius of the iris; and (xP, yP) and (xI, yI) are the coordinates of points bordering the pupil’s radius and iris’ radius respectively along the direction θ.

Fig. 2. Daugman’s Rubber sheet model

196

I. Bouraoui, S. Chitroub, and A. Bouridane

4 Feature Extraction by ICA The iris has an interesting structure and presents rich texture information. The distinctive spatial characteristics of the human iris are available at a variety of scales [7]. As such, a well-known subspace analysis technique such as Independent Component Analysis (ICA) is used to capture local distinctive information in an iris and creates a set of compact features for an effective recognition task. 4.1 Independent Component Analysis ICA represents a novel and powerful statistical method for subspace analysis, with applications in computational neuroscience and engineering. It consists of automatically identifying the underlying components in a given data set. It requires that at least as many simultaneously recorded mixtures as there are components and each mixture is a combination of components that are independent and nongaussian. However, like all methods, the success of ICA in a given application depends on the validity of the assumptions on which ICA is based and the results should be treated with caution. So, much theoretical work remains to be done on precisely how ICA fails when its assumptions, i.e. linear mixing and statistical independence, are severely violated [23], [20]. Generally, the most popular noising – free linear model of ICA is expressed as follows X = AS

(8) where X is a vector variable, of dimension N, in which each variable is an observed signal mixture and S is a vector variable, of dimension M, in which each variable is a source signal. We assume that N > M. The mixing matrix A defines a linear transformation on S, which can usually be reversed in order to recover an estimate U of S from X, i.e.

S ≈ y = WX −1

(9)

where the separating matrix W = A is the inverse of A. However, A is an unknown matrix and cannot therefore be used to find W. Instead, many iterative algorithms are used to approximate W in order to optimize independence of S. In this paper, the Flexible-ICA algorithm [24] is deployed. Since mutual information is the natural information-theoretic measure of the independence of random variables, it could be used as the criterion for finding the ICA transform. In this approach, which is an alternative to the model estimation approach, the ICA of a random vector X is defined as an invertible transformation as in (9), where the matrix W is determined so that the mutual information of the transformed components of S is minimized. Mutual information is a natural measure of the dependence between random variables. It can be interpreted by using the concept of differential entropy H of a random vector y with density f (.) as follows [26]

Flexible-ICA Algorithm for a Reliable Iris Recognition



H ( y ) = − f ( y ) log f ( y )dy

197

(10)

Entropy is considered as the coding length of the random variable yi , i = 1…N. In fact, it is defined as H ( yi ) = −

 P ( y ) log P ( y ) i

i

i

(11)

However, mutual information I between the N (scalar) random variables yi, i = 1...N [25], [27], is defined as N

I ( y1 , y 2 , , y N ) =

 H ( y ) − H ( y) i

(12)

i =1

Using the invertible linear transformation presented in (9). Mutual information [25], [26], is given by

I ( y1 , y2 ,, y N ) =

 H ( y ) − H ( x) − log det W i

(13)

i

To search space of separating matrix or Stiefel manifold W, let us consider that yi have been uncorrelated and have unit variance. This means E [ yy T ] = WE [ xx T ]W T = I

(14)

which implies det I

= 1 = det WE [ xx T ]W T =

det W det E [ xx T ] det W T

(15)

This requires that (detW) must be constant. In this case, the minimisation of mutual information leads to the following loss function N

L (W ) = −

 log p ( y ) i

i

(16)

i =1

The gradient of loss function (16) is given by ∇ L (W ) =

∂L (W ) = ϕ ( y ) xT ∂W

(17)

where

ϕ ( y ) = [ϕ1 ( y1 ),  , ϕ N ( y N )]T and

(18)

198

I. Bouraoui, S. Chitroub, and A. Bouridane

ϕ ( yi ) = −

d log pi ( yi ) dyi

(19)

The natural Reimannian gradient in Stiefel Manifold was calculated by [28] and it can be written as follows ∇L(W ) = ∇L(W ) − W [∇L(W )]T W = ϕ ( y ) xT − yϕ T ( y)W

(20)

With this, the learning algorithm for W takes the form [28], [29]: ΔW

= − η ∇ L(W ) = η[ yϕ T ( y)W − ϕ ( y ) xT ]

(21)

where η is a learning rate (small positive constant) and φ(y) is non-linear function, noted by ϕ ( y) =

1 log(cosh(a1 y )) a1

(22)

where 1 < a1 < 2 is some suitable constant. In the learning process, the increment ΔW should satisfy the constraint

ΔWW T + WΔW T = 0

(23)

4.2 Feature Extraction

Image representations are often based on discrete linear transformations of the observed data. Consider a black-and-white image whose gray-scale value at the pixel indexed by x and y, denoted by I ( x, y ) . Many basic models in image processing express the image I ( x, y ) as a linear superposition of some features or basis functions ai ( x, y ) , that is M

I ( x, y ) =

 a ( x, y ) s i

i

(24)

i =1

where s i are feature coefficients. These basis functions, ai ( x, y ) , are able to capture the inherent structure of the iris texture. This, particularity allows us to apply ICA and thus create a set of compact features for an effective recognition task. Alternatively, we can just collect all the pixel values in a single vector X, in which case we can express the representation as in (8) for ICA model. We assume here that the number of transformed components is equal to the number of observed variables. This type of a linear superposition model gives a useful description on a low level support where we can ignore such higher-level nonlinear phenomena such as occlusions. For the sake of simplicity, let us restrict ourselves here to the simple case where the variables

Flexible-ICA Algorithm for a Reliable Iris Recognition

199

ai (x,y) form an invertible linear system, that is, the matrix A is square. Then we can invert the system as s i =  wi ( x, y )I (x, y )

(25)

x, y

where the w i denote the inverse filters of ICA. In practice, we cannot model a whole image using the model in (24). Rather, we apply it on image patches or windows [23]. Thus we partition the image into patches of n×n pixels and model the patches with the model in (24). However, care must then be taken to avoid border effects. Before extracting the iris features, we note that the ICA application is greatly simplified if the vector X of all iris images is first whitened or sphered. There are two common pre-processing steps. The first step is to center the images as, X = X − E{X } in order to make their local mean equal 0. The next step is to apply a whitening transform B to the data such that −1

(26)

B = D 2 ET

with E corresponds to the eigenvectors of the covariance matrix of X and the diagonal matrix D contains the related eigenvalues. The whitening process helps to uncorrelate the data so that Principal Component Analysis (PCA) can work with a unit variance. The whitened data are used as the input for the Flexible-ICA algorithm [24], demonstrated above, which computes a set of basis vector, wi from a set of iris images, and the images are projected into the compressed subspace to obtain a set of coefficients, si. New test images are then matched to these known coefficients by projecting them onto the basis vectors and finding the closest coefficients in the subspace.

5 Matching It is very important to present the obtained feature vector in a binary code because it is easier to determine the difference between two binary code-words than between two number vectors. In fact, Boolean vectors are always easier to compare and to manipulate. We have applied a Hamming Distance matching algorithm for the recognition of two samples. It is basically an exclusive OR (XOR) function between two bit patterns. Hamming Distance is a measure, which delineates the differences of iris codes. Every bit of a presented iris code is compared to the every bit of referenced iris code, if the two bits are the same, e.g. two 1’s or two 0’s, the system assigns a value ‘0’ to that comparison and if the two bits are different, the system assigns a value ‘1’ to that comparison. The formula for iris matching is shown as follows HD =

1 N

P ⊕Q i

i

(27)

where N is the dimension of feature vector, Pi is the ith component of the presented feature vector, while Qi is the ith component of the referenced feature vector.

200

I. Bouraoui, S. Chitroub, and A. Bouridane

6 Experimental Results and Analysis This section is concerned with description of the database used in our analysis, the iris recognition process based on ICA, the results obtained and their analysis using some mathematical criteria including a comparative study against some existing methods. 6.1 Iris Database

CASIA-IrisV3 includes three subsets which are labelled as CASIA-IrisV3-Interval, CASIA-IrisV3-Lamp, and CASIA-IrisV3-Twins [21]. CASIA-IrisV3 contains a total of 22,051 iris images taken from more than 700 subjects and 1,500 eyes. All iris images are 8 bit gray-level JPEG files, collected under near infrared illumination. Almost all subjects are Chinese except a few in CASIA-IrisV3-Interval. Because the three data sets were collected at different times, only CASIA-IrisV3-Interval and CASIA-IrisV3-Lamp have a small overlap in subjects, which are used to test our algorithm. CASIA V3-Interval. contains a total of 2,655 iris images from more than 249 subjects and 396 classes. Iris images were captured in two sessions, with one month interval using specialised digital optics developed by the National Laboratory of Pattern Recognition, China. The captured images with a resolution of 320×280 are original unmasked with very good quality and the iris texture details are extremely clear. CASIA V3-Lamp. contains a total of 16,213 non-ideal iris images from more than 411 subjects and 819 classes, the collection was taken in one session by using OKI’s hand-held iris sensor. The captured images with a resolution of 640×480 have nonlinear deformation due to variations of visible illumination. 6.2 Experimental Steps

The experiments were performed in Matlab 7.3 on a T2330 dual-Core 1.60 GHz CPU with 2048 M RAM. The results have been obtained by using 1530 iris images including 306 classes of CASIA-IrisV3-Interval subset and 2052 iris images including 228 classes of CASIA-IrisV3-Lamp subset of CASIA Iris Image Database V3. Each iris image should be localised by detecting its inner and outer boundary and its eyelids and eyelashes, unwrapped and converted into its polar equivalent; where a number of data points are selected along each radial line and this is defined as the radial resolution and the number of radial lines going around the iris region is defined as the angular resolution. Then a histogram stretching method was used to obtain a well distributed iris images. Figure 3 gives an example of an iris sample of each subset with its pre-processing steps. So we have obtained a total of 1530 pre-processed image samples of CASIAIrisV3-Interval and 2052 of CASIA-IrisV3-Lamp of size of 32×240 pixels used for features extraction process, as illustrated in figure 4, which consists of determining the values of si and wi (x,y) for all i and (x,y), of the representation in (25) by giving a sufficient number iris image patches I(x,y).

Flexible-ICA Algorithm for a Reliable Iris Recognition

201

First, we consider one image for each class, (i.e. 306 or 228 images if we use CASIA V3-Interval or CASIA V3-Lamp, respectively). These images are partitioned to 10,000 image patches of size of n×n pixels, which were taken at random locations from the pre-processed images, and gathered by normalising each image patch to column vector of size (n2×1), then held into matrix X of size (n2×10,000). The dimension of X is reduced to R×10,000, (R < n2). The separating matrix W is calculated using the whitening process and Flexible-ICA algorithm described above. In the other hand, all iris images of a given subset are partitioned to n×n image patches and each one is normalised to a column vector and held in Xˆ , this later is projected in stiefel manifold W(R×n2) in order to obtain feature vectors S. Then, we generate the corresponding iris code for storage and comparison purposes. The encoding method of iris code is to assign values of 0 or 1 to each IC coefficient 1 if Q(Si ) =  0 if

Si > 0 Si ≤ 0

(28)

Finally, we use the Hamming distance to compare two iris codes. The hamming distance is the count of bits different in the two iris codes. 6.3 Evaluation Criteria

To evaluate the similarity of the projected iris images, we have used a corresponding matching metric that should give one range of values when comparing projected iris images of the same eye (intra-class comparisons) and another range of values when comparing the projected iris images created from different irises (inter-class comparisons). These two cases should give distinct and separate values, so that a decision can be made with high confidence as to whether two projected iris images are from the same iris, or from two different irises. The experiments were completed in a verification mode, the receiver operating characteristic (ROC) curve and equal error rate (EER) are used to evaluate the performance of the proposed method. The ROC curve is a false acceptance rate (FAR) versus false rejection rate (FRR) curve, which measures the accuracy of matching process and shows the overall performance of an algorithm. The FAR is the probability of accepting an imposter as an authorized subject and the FRR is the probability of an authorized subject being incorrectly rejected. Points on this curve denote all possible system operating states in different tradeoffs. The ideal FAR versus FRR curve is a horizontally straight line with zero false rejection rate. The EER is the point where the false acceptance rate and the false rejection rate are equal in value. The smaller the EER is, the better the algorithm. The accuracy of the system, feature vector size and computational complexity are also used to compare our iris proposed algorithm with the algorithms proposed by Daugman [8], Ma et. al. [16] and Tan et al. [15].

202

I. Bouraoui, S. Chitroub, and A. Bouridane

(a)

(b)

(c)

(d)

(e) (f) Fig. 3. Iris image pre-processing steps of a sample of each subset of CASIA Iris database, CASIA V3-Interval and CASIA V3-Lamp (left and right), (a) original iris, (b) iris localisation, (c) eyelash and eyelids detection, (d) unwrapped iris with a radial resolution of 32 pixels and angular resolution of 240 pixels, (e) normalised and (f) enhanced iris.

Flexible-ICA Algorithm for a Reliable Iris Recognition

203

Fig. 4. Block diagram of feature extraction process

6.4 Results and Discussion

To assess the accuracy of the proposed ICA algorithm, each iris image in the database is compared with all the other irises in the CASIA Iris database. For the CASIA-V3Interval subset, the total number of comparisons is 49,725, where the total number of intra-class comparisons is 3,060 and that of inter-class comparisons are 46,665. For the CASIA-V3-Lamp subset, the total number of comparisons is 34,086, where the total number of intra-class comparisons is 8,208 and that of inter-class comparisons are 25,878. The intra-class and inter-class distance distribution plot of Hamming Distance calculated by the proposed ICA algorithm of CASIA-V3-Interval and CASIAV3-Lamp are showed in figure 5. Figure 5 shows that the distance between the intra-class and the inter-class distribution is large, indicating a good discriminability of the extracted features. Figure 5 reveals also that the intra-class distance distribution of CASIA-V3-Lamp is larger than the intra-class distance distribution of CASIA-V3-Interval, the reason being that CASIA-IrisV3-Lamp images were taken by the variation of visible illumination with nonlinear deformation which gives bad results of the phase of localisation and normalisation. This is verified by the following verification results. Table 1 shows EERs for each subset and each reduced size of ICA coefficients, given by R={56, 40, 32, 24, 20, 16, 12, 10, 8} and image patches sizes given by 16×16 and 8×8 pixels of CASIA V3-Interval and CASIA V3-Lamp. As mentioned in table 1, EERs obtained by CASIA-IrisV3-Interval according to different ICs values, when image patches size is 8×8 pixels, are lower than 0.2%. The reason that CASIA-IrisV3-Interval images are very good quality resulting in extremely clear iris texture details. These results are better than results obtained when image patches size is 16×16 pixels. This means that ICs values are very small compared to image patches size and the whitened data fails to capture enough information on the original data, this requires increasing the number ICA coefficients. However, EERs

204

I. Bouraoui, S. Chitroub, and A. Bouridane

obtained using CASIA-IrisV3-Lamp with 16×16 image patches size, which lie between 12.34 % and 15.54 %, are better than those obtained with 8×8 image patches size, which are between 15.69% and 16.89% when the reduced size is higher than 20. An explanation for this behavior can be that the small eigenvalues correspond to highfrequency components and usually encodes noise. We can see that the error rate increases when ICA coefficients decrease, but when the information is strongly affected by noise according to some coefficients, the performance does not always decrease with the reduction of ICA coefficients, and this leads to an unstable ICA estimation. The performance of ICA is evaluated by taking the best result shown in table 1 which will be compared with algorithms proposed by Daugman [9] [19], Ma et. al. [16] and Tan et al. [15] using the CASIA-V3-Interval iris image database [21]. These previous methods for iris recognition mainly focus on feature representation and matching. Therefore, we only analyze and compare the accuracy, efficiency and computational complexity of feature representation of these methods. The method proposed by Daugman [9] is probably the best-known. Daugman used phase information to represent local shape of the iris details. He projected each small local region onto a bank of Gabor filters, and then quantized the resulting phasor denoted by a complex-valued coefficient to one of the four quadrants in the complex plane. In essence, Daugman analyzed the iris texture by computing and quantizing the similarity between the quadrature wavelets and each local region, which requires that the size of the local region must be small enough to achieve high accuracy, the dimensionality of the feature vector is 2,048 components. Ma et al. method [16] constructs a set of intensity signals to contain the most important details of the iris and makes use of stable and reliable local variations of the intensity signals as features, their method contains about 660 components, this is because that their method only records the position of local sharp variations as features and contains less redundant information. In [15], Tan et al. utilize multichannel spatial filters to extract texture features of the iris within a wider frequency range, this indicates that the extracted features are more discriminating, they extract local characteristics of the iris from the viewpoint of texture analysis, and the dimensionality of the feature vector is 1,600 components. Figure 6 shows the ROC curves of such verification algorithms. From the results shown in Figure 6, we can see that our proposed method has the best performance, followed by both Ma et al. and Daugman methods which are slightly better than the method of Tan et al... Our proposed method is based on flexible-ICA algorithm which extracts global features in pre-processing step that reduces dimensions for obtaining ICA components for iris; ICA explores independent components of fine iris features. These components of ICA are statistically independent, which reflect iris detail information (such as freckles, coronas, strips, furrows, crypts, and so on) change, whose distribution indicates iris individual difference for each class. So, the local basis images obtained with ICA can lead to more precise representations. Since ICA reduces significantly the size of iris code, this leads to decrease of processing time. Table 2 shows that our method consumes less time than others, followed by both Tan and Ma methods which are based on 1-D signal analysis. However, Daugman method involves 2-D mathematical operation. These comparisons indicate that our algorithm has an effective and emerging performance in iris recognition.

Flexible-ICA Algorithm for a Reliable Iris Recognition

205

3000

8000 Intra-class Inter-class

7000

Intra-class Interclass

2500

6000 2000 Frequenc y

Frequency

5000 4000

1500

3000 1000

2000 500

1000 0

0

10

20

30

40 50 60 70 Normalised threshold (%)

80

90

100

0

0

10

(a)

20

30

40 50 60 70 Normalised threshold (%)

80

90

100

(b)

Fig. 5. Results of intra-class and inter-class distributions of (a) CASIA V3-Interval and (b) CASIA V3-Lamp Table 1. Performance evaluation according to numbers of independent component and image patches sizes of CASIA V3-Interval and CASIA V3-Lamp Database Win. size

ICs

56 40 32 24 20 16 12 10 8

CASIA-V3-Interval 8x8 16x16 0.10 % 0.03 % 0.03 % 0.10 % 0.16 % 0.03 % 0.10 % 0.34 % 0.14 % 0.86 % 0.04 % 0.45 % 0.16 % 3.95 % 0.13 % 3.72 % 0.13 % 2.25 %

CASIA-V3-Lamp 8x8 16x16 15.69 % 12.48 % 16.89 % 12.34 % 16.88 % 15.54 % 16.22 % 13.62 % 16.19 % 18.09 % 16.82 % 17.08 % 19.93 % 20.83 % 16.76 % 19.28 % 18.34 % 23.96 %

Table 2. Performance comparison of the algorithms Methods Daugman [9]

Feature vector size (bit/image) 2048

Performance (%) 0.08

Computational Complexity (ms) 285

Ma et al. [16]

660

0.07

95

Tan et al. [15]

1600

0.48

80.3

Proposed ICA

960

0.04

31.2

206

I. Bouraoui, S. Chitroub, and A. Bouridane

1.6 Proposed ICA Daugman [9] Ma et al. [16] Tan et al. [15]

1.4

False Rejection Rate (%)

1.2

1

0.8

0.6

0.4

0.2

0 -3 10

-2

10

-1

0

10 10 False Acceptance Rate (%)

1

10

2

10

Fig. 6. Comparison of ROC curves

7 Conclusion In this paper, we have introduced an effective iris recognition system based on Hough transform for iris localisation, Daugman's cartesian to polar transform for normalisation, ICA for feature extraction with partition of iris images into patches, and hamming distance for matching. Two iris image subsets of CASIA iris V3 database have been used to evaluate the performance of our system. FlexibleICA algorithm, which improves the quality of separation introducing a better density matching and allows a faster learning, has been adopted for computing the ICs. The results obtained suggest that ICA could perform well for such application with small size of image patches of iris images with very good quality, but it is insensitive to variant illuminations and noise caused by eyelids and eyelashes, and even for blurred iris images, which is a limitation of ICA. For this, we propose to adopt other existing ICA algorithms, such as Noisy-ICA algorithm, for this application. Through comparing with some existing methods, all experimental results have demonstrated that Flexible-ICA algorithm achieves very good performance in both speed and accuracy. One of the future directions on iris recognition is on the protection of this system from attacks. We would like to propose a multifactor biometric authentication method based on cryptosystem keys containing biometric signatures.

Flexible-ICA Algorithm for a Reliable Iris Recognition

207

References 1. Wang, Y., Han, J.Q.: Iris Recognition Using Independent Component Analysis. In: Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, August 18-21 (2005) 2. Ross, A.A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer Science+Business Media, LLC (2006) 3. Jain, A., Bolle, R., Pankanti, S. (eds.): Biometrics: Personal Identification in a Networked Society. Kluwer, Dordrecht (1999) 4. Zhang, D.: Automated Biometrics: Technologies and Systems. Kluwer, Dordrecht (2000) 5. Mansfield, T., Kelly, G., Chandler, D., Kane, J.: Biometric Product Testing Final Report. issue 1.0, Nat’l Physical Laboratory of UK (2001) 6. Obaidat, M.S., Sadoun, B.: Verification of computer users using keystroke dynamics. IEEE Trans. Syst. Man Cybernet. 27, 261–269 (1997) 7. Wayman, J., Jain, A., Maltoni, D., Maio, D.: Biometric Systems, Technology, Design and Performance Evaluation. Springer, London (2005) 8. Flom, L., Safir, A.: Iris recognition system. U.S. Patent 4,641,349 (1987) 9. Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE PAMI 15(11), 1148–1161 (1993) 10. Wildes, R.P.: Automated iris recognition: An emerging biometric technology. Proceedings of the IEEE 85, 1348–1363 (1997) 11. Boles, W., Boashash, B.: A human identification technique using images of the iris and wavelet transform. IEEE Trans. Signal Proc. 4, 1185–1188 (1998) 12. Lim, S., Lee, K., Byeon, O., Kim, T.: Efficient iris recognition through improvement of feature vector and classifier. ETRI J. 23(2), 61–70 (2001) 13. Tisse, C., Martin, L., Torres, L., Robert, M.: Person identification technique using human iris recognition. In: Proc. of Vision Interface, pp. 294–299 (2002) 14. Zhu, Y., Tan, T., Wang, Y.: Biometric personal identification based on iris pattern. In: ICPR 2000: the 15th International Conference on Pattern Recognition, Barcelona, Spain, pp. 805–808 (2002) 15. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personel identification based on iris texture analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(12), 1519–1533 (2003) 16. Ma, L., Tan, T., Wang, Y., Zhang, D.: Efficient Iris Recognition by Characterizing Key Local Variations. IEEE Transactions on Image Processing 13(6), 739–750 (2004) 17. Du, Y., Ives, R., Chang, C.-I., Etter, D., Welch, T.: Information divergence-based iris pattern recognition for automatic human identification. In: Proc. SPIE, vol. 5404 (2004) 18. Du, Y., Ives, R., Chang, C.-I., Ren, H., D’Amico, F., Jensen, J.: A new hyperspectral discrimination measure for spectral similarity. Optical Engineering 43(8) (2004) 19. Daugman, J.G.: How iris recognition works. IEEE Trans. Circuits and Syst. for Video Tech. 14(1), 21–30 (2004) 20. Stone, J.V.: Independent Component Analysis. A Tutorial Introduction. A Bradford Book, MIT Press (2004) 21. Download the application form at the website, http://www.cbsr.ia.ac.cn/IrisDatabase.htm 22. Sanderson, S., Erbetta, J.: Authentication for secure environments based on iris scanning technology. IEEE Colloquium on Visual Biometrics (2000) 23. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley, Chichester (2001)

208

I. Bouraoui, S. Chitroub, and A. Bouridane

24. Choï, S., Cichocki, A., Amari, S.: Adaptative Blind Signal and Image Processing: Learning Algorithms Applications. John Wiley & Sons, Chichester (2002) 25. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991) 26. Papoulis, A.: Probability, Random Variables, and Stochastic Processes, 3rd edn. McGrawHill, New York (1991) 27. Comon, P.: Independent component analysis—a new concept? Signal Processing 36, 287– 314 (1994) 28. Amari, S.: Natural Gradient for over under complete bases in ICA. Neural Computation 11(8), 1875–1883 (1999) 29. Vigliano, D., Parisi, R., Uncini, A.: A flexible ICA approach to a novel BSS convolutive nonlinear problem: preliminary results. Biological and Artificial Intelligence Environments Part 3, 217–224 (2005), doi:10.1007/1-4020-3432-6_26 30. Meena, B.R.: Personal Identification based on Iris Patterns. Ph.D Thesis, Department of Computer Science and Engineering, IndianInstitute of Technology, Kanpur (2004) 31. Vatsa, M., Singh, R., Noore, A.: Reducing the False Rejection Rate of Iris Recognition Using Textural and Topological Features. International Journal of Signal Processing 2(2), 66–72 (2005) 32. Monro, D.M., Rakshit, S., Zhang, D.: DCT-Based Iris Recognition. IEEE Transactions on Pattern analysis and Machine Intelligence 29(4), 586–595 (2007) 33. Dorairaj, V., Fahmy, G., Schmid, N.: Performance evaluation of iris based recognition system implementing PCA and ICA techniques. In: Proc. of SPIE Conference on Biometric Technology for Human Identification (2005) 34. Bae, K., Noh, S., Kim, J.: Iris feature extraction using independent component analysis. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 838–844. Springer, Heidelberg (2003) 35. Oyster, C.: The Human Eye Structure and Function. Sinauer Associates (1999) 36. US NLM/NIH Medline Plus. Cataract, http://www.nlm.nih.gov/medlineplus/ency/article/001001.htm (accessed October 2006) 37. US NLM/NIH Medline Plus. Glaucoma, http://www.nlm.nih.gov/medlineplus/ency/article/001620.htm (accessed October 2006) 38. European Commission. Biometrics at the frontiers: Assessing the impact on society. Institute for Prospective Technological Studies, Technical Report EUR 21585 EN (European Commission Director-General Joint Research Centre) (February 2005) 39. US NLM/NIH Medline Plus. Albinism, http://www.nlm.nih.gov/medlineplus/ency/article/001479.htm (accessed January 2007) 40. Hyvärinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999)

Author Index

Alimazighi, Zaia 95 Amarouche, Idir Amine

95

Barhamgi, Mahmoud 95 Baumgartner, Christian 148 Bechini, Alessio 15 Benslimane, Djamal 95 Bonfante, Paola 124 Bouraoui, Imen 188 Bouridane, Ahmed 188

Lanfranco, Luisa Leonardi, Giorgio Li, Chen 73

Meo, Rosa 124 Montani, Stefania 124 Mrissa, Michael 95 Netzer, Michael

148

Pfeifer, Bernhard Chitroub, Salim 188 Cordero, Francesca 124 Couto, Francisco 40 Dander, Andreas Ghignone, Stefano Giannini, Riccardo

148 124 15

Seger, Michael 148 Silva, Fabr´ıcio A.B. 40 Silva, M´ ario J. 40 Sobolev, Andrey 1 Visconti, Alessia

Ienco, Dino

Wachtler, Thomas

Jamil, Hasan

158

148

Ramampiaro, Heri 73 Rautenberg, Philipp L. Roversi, Luca 124

Handler, Michael 148 Herz, Andreas 1 124

124 124

Zamite, Jo˜ ao

40

124 1

1

Transactions on Data Hiding and Multimedia Security IV (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Systems Biology XII: Special Issue on Modeling Methodologies (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Edutainment IV (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems, 5740)

Hybrid Systems (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Hybrid Systems IV (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Computational Science IX: Special Issue on Voronoi Diagrams in Science and Engineering (Lecture Notes in Computer Science Transactions on Computational Science)

Transactions on Data Hiding and Multimedia Security V (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Science XIV: Special Issue on Voronoi Diagrams and Delaunay Triangulation (Lecture Notes in Computer Science Transactions on Computational Science)

Notes on Dynamical Systems (Courant Lecture Notes)

Transactions on Computational Collective Intelligence I (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Transactions on Edutainment II (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Computational Collective Intelligence II (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Software Engineering for Resilient Systems (Lecture Notes in Computer Science)

Spatial Data Types for Database Systems: Finite Resolution Geometry for Geographic Information Systems (Lecture Notes in Computer Science)

Computer Vision Systems (Lecture Notes in Computer Science)

Transactions on Data Hiding and Multimedia Security II (Lecture Notes in Computer Science)

Transactions on Computational Systems Biology XI: Computational Models for Cell Processes (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Large-Scale Data- and Knowledge-Centered Systems IV. Special Issue on Database Systems for Biomedical Applications (Lecture Notes in Computer Science)

Transactions on Data Hiding and Multimedia Security IV (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Systems Biology XII: Special Issue on Modeling Methodologies (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Edutainment IV (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems, 5740)

Hybrid Systems (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Hybrid Systems IV (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Computational Science IX: Special Issue on Voronoi Diagrams in Science and Engineering (Lecture Notes in Computer Science Transactions on Computational Science)

Transactions on Data Hiding and Multimedia Security V (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Science XIV: Special Issue on Voronoi Diagrams and Delaunay Triangulation (Lecture Notes in Computer Science Transactions on Computational Science)

Notes on Dynamical Systems (Courant Lecture Notes)

Transactions on Computational Collective Intelligence I (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Transactions on Edutainment II (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Rough Sets XII (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Edutainment III (Lecture Notes in Computer Science Transactions on Edutainment)

Exercises in Computer Systems Analysis (Lecture Notes in Computer Science)

Transactions on Rough Sets XI (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Rough Sets X (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Computational Collective Intelligence II (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Software Engineering for Resilient Systems (Lecture Notes in Computer Science)

Spatial Data Types for Database Systems: Finite Resolution Geometry for Geographic Information Systems (Lecture Notes in Computer Science)

Computer Vision Systems (Lecture Notes in Computer Science)

Transactions on Data Hiding and Multimedia Security II (Lecture Notes in Computer Science)

Transactions on Computational Systems Biology XI: Computational Models for Cell Processes (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Computational Systems Biology

Transactions on Computational Systems Biology

Journal on Data Semantics XIV (Lecture Notes in Computer Science Journal on Data Semantics)

Journal on Data Semantics XII (Lecture Notes in Computer Science Journal on Data Semantics)

Transactions on Large-Scale Data- and Knowledge-Centered Systems IV. Special Issue on Database Systems for Biomedical Applications (Lecture Notes in Computer Science)

Transactions on Data Hiding and Multimedia Security IV (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Systems Biology XII: Special Issue on Modeling Methodologies (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Edutainment IV (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems, 5740)

Hybrid Systems (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems I (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Hybrid Systems IV (Lecture Notes in Computer Science)

Transactions on Large-Scale Data- and Knowledge-Centered Systems II (Lecture Notes in Computer Science Transactions on Large-Scale Data- and Knowledge-Centered Systems)

Transactions on Computational Science IX: Special Issue on Voronoi Diagrams in Science and Engineering (Lecture Notes in Computer Science Transactions on Computational Science)

Transactions on Data Hiding and Multimedia Security V (Lecture Notes in Computer Science Transactions on Data Hiding and Multimedia Security)

Transactions on Computational Science XIV: Special Issue on Voronoi Diagrams and Delaunay Triangulation (Lecture Notes in Computer Science Transactions on Computational Science)

Notes on Dynamical Systems (Courant Lecture Notes)

Transactions on Computational Collective Intelligence I (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Transactions on Edutainment II (Lecture Notes in Computer Science Transactions on Edutainment)

Transactions on Rough Sets XII (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Edutainment III (Lecture Notes in Computer Science Transactions on Edutainment)

Exercises in Computer Systems Analysis (Lecture Notes in Computer Science)

Transactions on Rough Sets XI (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Rough Sets X (Lecture Notes in Computer Science Transactions on Rough Sets)

Transactions on Computational Collective Intelligence II (Lecture Notes in Computer Science Transactions on Computational Collective Intelligence)

Software Engineering for Resilient Systems (Lecture Notes in Computer Science)

Spatial Data Types for Database Systems: Finite Resolution Geometry for Geographic Information Systems (Lecture Notes in Computer Science)

Computer Vision Systems (Lecture Notes in Computer Science)

Transactions on Data Hiding and Multimedia Security II (Lecture Notes in Computer Science)

Transactions on Computational Systems Biology XI: Computational Models for Cell Processes (Lecture Notes in Computer Science Transactions on Computational Systems Biology)

Transactions on Computational Systems Biology

Transactions on Computational Systems Biology

Journal on Data Semantics XIV (Lecture Notes in Computer Science Journal on Data Semantics)

Journal on Data Semantics XII (Lecture Notes in Computer Science Journal on Data Semantics)

Recommend Documents