Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5740
Abdelkader Hameurlain Josef Küng Roland Wagner (Eds.)
Transactions on Large-Scale Data- and KnowledgeCentered Systems I
13
Volume Editors Abdelkader Hameurlain Paul Sabatier University Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne 31062 Toulouse Cedex, France E-mail:
[email protected] Josef Küng Roland Wagner University of Linz, FAW Altenbergerstraße 69 4040 Linz, Austria E-mail: {jkueng,rrwagner}@faw.at
Library of Congress Control Number: 2009932361 CR Subject Classification (1998): H.2, H.2.4, H.2.7, C.2.4, I.2.4, I.2.6
ISSN ISBN-10 ISBN-13
0302-9743 3-642-03721-6 Springer Berlin Heidelberg New York 978-3-642-03721-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12738045 06/3180 543210
Preface
Data management, knowledge discovery, and knowledge processing are core and hot topics in computer science. They are widely accepted as enabling technologies for modern enterprises, enhancing their performance and their decision making processes. Since the 1990s the Internet has been the outstanding driving force for application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolvement of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource characterized by: heterogeneity of nodes, data, and knowledge autonomy of data and knowledge sources and services large-scale data volumes, high numbers of data sources, users, computing resources dynamicity of nodes These characteristics recognize: (i) (ii) (iii)
limitations of methods and techniques developed for centralized systems requirements to extend or design new approaches and methods enhancing efficiency, dynamicity, and scalability development of large scale, experimental platforms and relevant benchmarks to evaluate and validate scaling
Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and agent systems supporting with scaling and decentralized control. Synergy between Grids, P2P systems and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. The objective of the international journal on Large-Scale Data- and Knowledge-Centered Systems is to provide an opportunity to disseminate original research contributions and to serve as a high-quality communication platform for researchers and practitioners. The journal contains sound peer-reviewed papers (research, state of the art, and technical) of high quality. Topics of interest include, but are not limited to: data storage and management data integration and metadata management data stream systems data/web semantics and ontologies knowledge engineering and processing sensor data and sensor networks dynamic data placement issues flexible and adaptive query processing
VI
Preface
query processing and optimization data warehousing cost models resource discovery resource management, reservation, and scheduling locating data sources/resources and scalability workload adaptability in heterogeneous environments transaction management replicated copy control and caching data privacy and security data mining and knowledge discovery mobile data management data grid systems P2P systems web services autonomic data management large-scale distributed applications and experiences performance evaluation and benchmarking. The first edition of this new journal consists of journal versions of talks invited to the DEXA 2009 conferences and further invited contributions by well-known scientists in the field. Therefore the content covers a wide range of different topics in the field. The second edition of this journal will appear in spring 2010 under the title: Datawarehousing and Knowledge Discovery (Guest editors: Mukesh K. Mohania (IBM, India), Torben Bach Perdersen (Aalborg University, Denmark), A Min Tjoa (Technical University of Vienna, Austria). We are happy that Springer has given us the opportunity to publish this journal and are looking forward to supporting the community with new findings in the area of largescale data- and knowledge-centered systems. In particular we would like to thank Alfred Hofmann and Ursula Barth from Springer for their valuable support. Last, but not least, we would like to thank Gabriela Wagner for her organizational work.
June 2009
Abdelkader Hameurlain Josef Küng Roland Wagner
Editorial Board
Hamideh Afsarmanesh Francesco Buccafurri Qiming Chen Tommaso Di Noia Georg Gottlob Anastasios Gounaris Theo Härder Zoé Lacroix Sanjay Kumar Madria Vladimir Marik Dennis McLeod Mukesh Mohania Tetsuya Murai Gultekin Ozsoyoglu Oscar Pastor Torben Bach Pedersen Günther Pernul Colette Rolland Makoto Takizawa David Taniar Yannis Vassiliou Yu Zheng
University of Amsterdam, The Netherlands Università Mediterranea di Reggio Calabria, Italy HP-Lab, USA Politecnico di Bari, Italy Oxford University, UK Aristotle University of Thessaloniki, Greece Technical University of Kaiserslautern, Germany Arizona State University, USA University of Missouri-Rolla, USA Technical University of Prague, Czech Republik University of Southern California, USA IBM India, India Hokkaido University, Japan Case Western Reserve University, USA Polytechnic University of Valencia, Spain Aalborg University, Denmark University of Regensburg, Germany Université Paris1 Panthéon Sorbonne, CRI, France Seikei University, Tokyo, Japan Monash University, Australia National Technical University of Athens, Greece Microsoft Research Asia, China
Table of Contents
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamideh Afsarmanesh, Ekaterina Ermilova, Simon S. Msanjila, and Luis M. Camarinha-Matos
1
A Universal Metamodel and Its Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Atzeni, Giorgio Gianforme, and Paolo Cappellari
38
Data Mining Using Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . Christian B¨ ohm, Robert Noll, Claudia Plant, Bianca Wackersreuther, and Andrew Zherdin
63
Context-Aware Data and IT Services Collaboration in E-Business . . . . . . Khouloud Boukadi, Chirine Ghedira, Zakaria Maamar, Djamal Benslimane, and Lucien Vincent
91
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines . . . . Javier C´ amara and Alfred Kobsa
116
Frontiers of Structured Business Process Modeling . . . . . . . . . . . . . . . . . . . Dirk Draheim
136
Information Systems for Federated Biobanks . . . . . . . . . . . . . . . . . . . . . . . . Johann Eder, Claus Dabringer, Michaela Schicho, and Konrad Stark
156
Exploring Trust, Security and Privacy in Digital Business . . . . . . . . . . . . . Simone Fischer-Huebner, Steven Furnell, and Costas Lambrinoudakis
191
Evolution of Query Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelkader Hameurlain and Franck Morvan
211
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao
243
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao, Paul Valckenaers, and Emmanuel Adam
267
Context Oriented Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukesh Mohania, Manish Bhide, Prasan Roy, Venkatesan T. Chakaravarthy, and Himanshu Gupta
289
X
Table of Contents
Data Sharing in DHT Based P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Roncancio, Mar´ıa del Pilar Villamil, Cyril Labb´e, and Patricia Serrano-Alvarado
327
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quoc Thai Tran, David Taniar, and Maytham Safar
353
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5740
Abdelkader Hameurlain Josef Küng Roland Wagner (Eds.)
Transactions on Large-Scale Data- and KnowledgeCentered Systems I
13
Volume Editors Abdelkader Hameurlain Paul Sabatier University Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne 31062 Toulouse Cedex, France E-mail:
[email protected] Josef Küng Roland Wagner University of Linz, FAW Altenbergerstraße 69 4040 Linz, Austria E-mail: {jkueng,rrwagner}@faw.at
Library of Congress Control Number: 2009932361 CR Subject Classification (1998): H.2, H.2.4, H.2.7, C.2.4, I.2.4, I.2.6
ISSN ISBN-10 ISBN-13
0302-9743 3-642-03721-6 Springer Berlin Heidelberg New York 978-3-642-03721-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12738045 06/3180 543210
Preface
Data management, knowledge discovery, and knowledge processing are core and hot topics in computer science. They are widely accepted as enabling technologies for modern enterprises, enhancing their performance and their decision making processes. Since the 1990s the Internet has been the outstanding driving force for application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolvement of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource characterized by: heterogeneity of nodes, data, and knowledge autonomy of data and knowledge sources and services large-scale data volumes, high numbers of data sources, users, computing resources dynamicity of nodes These characteristics recognize: (i) (ii) (iii)
limitations of methods and techniques developed for centralized systems requirements to extend or design new approaches and methods enhancing efficiency, dynamicity, and scalability development of large scale, experimental platforms and relevant benchmarks to evaluate and validate scaling
Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and agent systems supporting with scaling and decentralized control. Synergy between Grids, P2P systems and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. The objective of the international journal on Large-Scale Data- and Knowledge-Centered Systems is to provide an opportunity to disseminate original research contributions and to serve as a high-quality communication platform for researchers and practitioners. The journal contains sound peer-reviewed papers (research, state of the art, and technical) of high quality. Topics of interest include, but are not limited to: data storage and management data integration and metadata management data stream systems data/web semantics and ontologies knowledge engineering and processing sensor data and sensor networks dynamic data placement issues flexible and adaptive query processing
VI
Preface
query processing and optimization data warehousing cost models resource discovery resource management, reservation, and scheduling locating data sources/resources and scalability workload adaptability in heterogeneous environments transaction management replicated copy control and caching data privacy and security data mining and knowledge discovery mobile data management data grid systems P2P systems web services autonomic data management large-scale distributed applications and experiences performance evaluation and benchmarking. The first edition of this new journal consists of journal versions of talks invited to the DEXA 2009 conferences and further invited contributions by well-known scientists in the field. Therefore the content covers a wide range of different topics in the field. The second edition of this journal will appear in spring 2010 under the title: Datawarehousing and Knowledge Discovery (Guest editors: Mukesh K. Mohania (IBM, India), Torben Bach Perdersen (Aalborg University, Denmark), A Min Tjoa (Technical University of Vienna, Austria). We are happy that Springer has given us the opportunity to publish this journal and are looking forward to supporting the community with new findings in the area of largescale data- and knowledge-centered systems. In particular we would like to thank Alfred Hofmann and Ursula Barth from Springer for their valuable support. Last, but not least, we would like to thank Gabriela Wagner for her organizational work.
June 2009
Abdelkader Hameurlain Josef Küng Roland Wagner
Editorial Board
Hamideh Afsarmanesh Francesco Buccafurri Qiming Chen Tommaso Di Noia Georg Gottlob Anastasios Gounaris Theo Härder Zoé Lacroix Sanjay Kumar Madria Vladimir Marik Dennis McLeod Mukesh Mohania Tetsuya Murai Gultekin Ozsoyoglu Oscar Pastor Torben Bach Pedersen Günther Pernul Colette Rolland Makoto Takizawa David Taniar Yannis Vassiliou Yu Zheng
University of Amsterdam, The Netherlands Università Mediterranea di Reggio Calabria, Italy HP-Lab, USA Politecnico di Bari, Italy Oxford University, UK Aristotle University of Thessaloniki, Greece Technical University of Kaiserslautern, Germany Arizona State University, USA University of Missouri-Rolla, USA Technical University of Prague, Czech Republik University of Southern California, USA IBM India, India Hokkaido University, Japan Case Western Reserve University, USA Polytechnic University of Valencia, Spain Aalborg University, Denmark University of Regensburg, Germany Université Paris1 Panthéon Sorbonne, CRI, France Seikei University, Tokyo, Japan Monash University, Australia National Technical University of Athens, Greece Microsoft Research Asia, China
Table of Contents
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamideh Afsarmanesh, Ekaterina Ermilova, Simon S. Msanjila, and Luis M. Camarinha-Matos
1
A Universal Metamodel and Its Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Atzeni, Giorgio Gianforme, and Paolo Cappellari
38
Data Mining Using Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . Christian B¨ ohm, Robert Noll, Claudia Plant, Bianca Wackersreuther, and Andrew Zherdin
63
Context-Aware Data and IT Services Collaboration in E-Business . . . . . . Khouloud Boukadi, Chirine Ghedira, Zakaria Maamar, Djamal Benslimane, and Lucien Vincent
91
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines . . . . Javier C´ amara and Alfred Kobsa
116
Frontiers of Structured Business Process Modeling . . . . . . . . . . . . . . . . . . . Dirk Draheim
136
Information Systems for Federated Biobanks . . . . . . . . . . . . . . . . . . . . . . . . Johann Eder, Claus Dabringer, Michaela Schicho, and Konrad Stark
156
Exploring Trust, Security and Privacy in Digital Business . . . . . . . . . . . . . Simone Fischer-Huebner, Steven Furnell, and Costas Lambrinoudakis
191
Evolution of Query Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelkader Hameurlain and Franck Morvan
211
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao
243
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao, Paul Valckenaers, and Emmanuel Adam
267
Context Oriented Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukesh Mohania, Manish Bhide, Prasan Roy, Venkatesan T. Chakaravarthy, and Himanshu Gupta
289
X
Table of Contents
Data Sharing in DHT Based P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Roncancio, Mar´ıa del Pilar Villamil, Cyril Labb´e, and Patricia Serrano-Alvarado
327
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quoc Thai Tran, David Taniar, and Maytham Safar
353
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks Hamideh Afsarmanesh1, Ekaterina Ermilova1, Simon S. Msanjila1, and Luis M. Camarinha-Matos2 1
Informatics Institute, University of Amsterdam, Science Park 107, 1098 XG, Amsterdam, The Netherlands {h.afsarmanesh,e.ermilova,s.s.msanjila}@uva.nl 2 Faculty of Sciences and Technology, New University of Lisbon, Quinta da Torre, 2829-516, Monte Capatica, Portugal
[email protected]
Abstract. Fluent creation of opportunity-based short-term Collaborative Networks (CNs) among organizations or individuals requires the availability of a variety of up-to-date information. A pre-established properly administrated strategic-alliance Collaborative Network (CN) can act as the breeding environment for creation/operation of opportunity-based CNs, and effectively addressing the complexity, dynamism, and scalability of their actors and domains. Administration of these environments however requires effective set of functionalities, founded on top of strong information management. The paper introduces main challenges of CNs and their management of information, and focuses on the Virtual organizations Breeding Environment (VBE), which represents a specific form of strategic-alliances. It then focuses on the needed functionalities for effective administration/management of VBEs, and exemplifies information management challenges for three of their subsystems handling the Ontology, the profiles and competencies, and the rational trust. Keywords: Information management for Collaborative Networks (CNs), virtual organizations breeding environments (VBEs), Information management in VBEs, Ontology management, competency information management, rational trust information management.
1 Introduction The emergence of collaborative networks as collections of geographically dispersed autonomous actors which collaborate through computer networks, has led both organizations and individuals to effectively achieving common goals that go far beyond the ability of each single actor, and providing cost effective solutions, and value creating functionalities, services and products. The paradigm of “Collaborative Networks (CN)” being defined during the last decade represents a wide variety of networks of organizations as well as communities of individuals, where each has distinctive characteristics and features. While the taxonomy of existing CNs, as presented later A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 1–37, 2009. © Springer-Verlag Berlin Heidelberg 2009
2
H. Afsarmanesh et al.
(in section 2.1), indicates their categorical differences, some of their main characteristics are briefly introduced below. Wide diversity in structural forms, duration, behavioral patterns, as well as interaction forms is manifested by different collaborative networks. From the processoriented chain structures as observed in supply chains, to those centralized around dominant entities, and the project-oriented federated networks, there exists a wide range of collaborative network structures [1; 2; 3; 4]. Every structure differently influences the visibility level of each actor in the network, the intensity of its activities, co-working and involvement in decision making. Other important variant elements for networks are the variety of different life-cycle phases and durations. Goal-oriented networks are shorter-term and typically triggered by collaboration opportunities that rise in the market/society, as represented by the case of VOs (virtual organizations) established for a timely response to singular opportunities. Long-term networks on the other hand are strategic alliances / associations with the main purpose of enhancing the chances of their members to get involved in future opportunity-triggered collaboration networks, and increasing the visibility of their actors, thus serving as breeding environments for goal-oriented networks. As examples for long-term networks, the cases of industry clusters or industrial districts and sector-based alliances can be mentioned. In terms of the types of interaction among the actors involved in collaborative networks, although there is not a consensus among researchers, some working definitions are provided [5] for four main classes of interactions that are enumerated as: networking, coordinated networking, cooperation and collaboration. There is an intuitive notion of what collaboration represents, but this concept is often confused with the cooperation. Some researchers even use the two terms indistinguishably. The ambiguities around these terms reach a higher level when other related terms are also considered such as the networking, communication, and coordination [6; 7]. Therefore, it is relevant and important that for the CN research, the concepts behind these interaction terms are formalized, especially for the purpose of defining a reference model for collaborative networks, as later addressed in this paper (in Section 3). In an attempt to clarify these various concepts, based on [5], the following working definitions are proposed for these four classes of interactions, where in fact every concept defined below is itself a sub-class of the concept(s) defined above it: Networking – involves communication and information exchange among involved parties for mutual benefit. It shall be noted that this term has a broad use in multiple contexts and often with different meanings. In collaborative networks area of research, when referred to “enterprise network” or “enterprise networking” the intended meaning is probably “collaborative network of enterprises”. Coordinated Networking – in addition to the above, it involves complementarity of goals of different parties, and aligning / altering activities so that more efficient results can be achieved. Coordination, that is, the act of working together harmoniously, is one of the main components for collaboration. Cooperation – involves not only information exchange and alignments of activities, but also sharing of some resources towards achieving compatible goals. Cooperation may be achieved by division of some labor (not extensive) among participants.
Information Supporting Functional Dimension of Collaborative Networks
3
Collaboration – in addition to the above, it involves joint goals/responsibilities with specific process(es) in which parties share information, resources and capabilities and jointly plan, implement, and evaluate activities to achieve their common goals. It in turn implies sharing risks, losses and rewards. If desired by its involved parties, the collaboration can also give the image of a joint identity to the outside. In practice, collaboration typically involves mutual engagement of participants to solve a problem together, which also implies reaching mutual trust that takes time, effort, and dedication. As addressed above, different forms of interaction are suitable for different structural forms of CNs. For example the long term strategic alliances are cooperative environments, since they primarily comprise actors with compatible and/or complimentary goals towards which they align their activities. The shorter term goal-oriented networks however require intense co-working among their actors to reach their jointly-established common goals that represent the reason for their existence, and are therefore collaborative environments. On the other hand, most of the current social networks just show a networking level of interaction. But from a different perspective, different forms of the above mentioned interactions can also be seen as different levels of “collaboration maturity”. Namely, the interaction among actors in the network may strengthen in time, from simple networking interaction to intense collaboration. This implies gradual increase in the level of co-working as well as the risk taking, commitment, invested resources, etc., by the involved participants. Therefore, operational CNs represent a variety of interactions and inter-relationships among their heterogeneous and autonomous actors, which in turn increases the complexity of this paradigm. It shall be noted that in this paper and most other literature in the field, the concept of “collaborative networks or CNs” represent the generic name when referring to all varieties of such networks. 1.1 Managing the Information in CNs On managing the information in CNs, even if all information is semantically and syntactically homogeneous, a main generic challenge is related to assuring the availability of strategic information within the network, required for proper coordination and decision making. This can be handled through enforcement of a push/pull mechanism and establishment of proper mapping strategies and components, between the information managed at different sites belonging to actors in the network and all those systems (or sub-systems) that support different functionalities of the CN and its activities during its life cycle. Therefore, it is necessary that from autonomous actors involved in the CNs, various types of distributed information are collected. This information shall then be processed, organized, and accessible within the network, both for navigation by different CN stakeholders and for processing by different software systems running at the CN. However, although the information about actors evolves in time - which is typical of dynamic systems as CNs - and therefore need to be kept up to date, there is no need for continuous flow of all the information from each legacy system to the CN. This will generate a major overload on the information management systems at the CN. Rather, for effective CN’s operation and management,
4
H. Afsarmanesh et al.
only at some intervals, partial information needs to be pull/pushed from/to legacy systems to the CN. Need for access to information also varies depending on the purpose for which it is requested. These variations in turn pose a second generic information management challenge that is related to the classification, assessment, and provision of the required information based on intended use cases in CNs. Both of the above generic information challenges in CNs are addressed in the paper, and exemplified for three key CN functionalities, namely common ontology engineering, competency management, and trust management. A third generic information management challenge is related to modeling the variety and complexity of the information that needs to be processed by different functionalities, which support the management and operation of the CNs. While some of these functionalities deal with the information that is known and stored within the site of network’s actors (e.g. data required for actors’ competency management), the information required for some other functionalities of CNs may be unknown, incomplete, or imprecise, for which soft computing approaches, such as causal analysis and reasoning (addresses in Section 7.2) or other techniques introduced in computational intelligence, shall be applied to generate the needed information (e.g. data needed for trust management). The issue of modeling the information needed to be handled in CNs is addressed in details in the paper and also exemplified through the three example functionalities mentioned above. There is however a number of other generic challenges related to the management of the CN information, and it can be expected that more challenges will be identified in time as the need for other functional components unfolds in the research on supporting the management and operation of CNs. Among other identified generic challenges, we can mention: ensuring the consistency among the locally managed semantically and syntactically heterogeneous information at each organization’s legacy systems and the information managed by the management system of the CNs as well as their availability for access by the authorized CN stakeholders (e.g. individuals or organizations) when necessary. At the heart of this challenge lies the establishment of needed interoperation infrastructure, as well as a federated information management system supporting the inter-linking of autonomous information management systems. Furthermore, challenges related to update mechanisms among autonomous nodes are relevant. Nevertheless, these generic challenges are in fact common to many other application environments and are not specific to the CN’s information management area, and fall outside the scope of this paper. The remaining sections first address in Section 2 the collaborative networks, through presenting a taxonomy for collaborative networks and describing the main requirements for establishing the CNs, while emphasizing their information management and modeling aspects. Then in Section 3 it addresses the ARCON reference model for collaborative networks focusing only its endogenous elements. In Section 4, the paper further narrows down on the details of the functional dimension of the endogenous elements and exemplifying it for one specific kind of CN, i.e. the management system of the VBE strategic alliance. Specific examples of modeling and management of information are then provided in Sections 5, 6, and 7 for three subsystems of a VBE management system, addressing the VBE ontology engineering, management of profiles and competencies in VBEs, and assessment and management of trust in VBEs. Section 8 concludes the paper.
Information Supporting Functional Dimension of Collaborative Networks
5
2 Establishing Collaborative Networks Successful creation and management of inter-organizational and inter-personal collaborative networks are challenging. Cooperation and collaboration in CNs, although having the potential of bringing considerable benefits, or even representing a survival mechanism to the involved participants, are difficult processes (as explained in Section 2.2), which quite often fail [8; 9]. Therefore, there are a number of requirements that need to be satisfied to increase their chances for success. Clearly the severity of each requirement depends on the type and specificities of the CN. In other words the nature, goal, and vision of each case determine its critical points and requirements. For instance for the Virtual Laboratory type of CNs, the a priori setting up of the common collaboration infrastructure and maintaining this infrastructure afterwards pose some of their main challenges. However, for another type of CNs that can be focused on international decision making on environment issues such as the global warming, while setting up and maintaining the collaboration infrastructure is not too critical, provision of mediation mechanisms and tools to support building of trust among the CN actors and reaching of agreements on the definition of common CN policies, pose some main challenges. It is therefore important to briefly discuss the various types of CNs before addressing their requirements as addressed below with the taxonomy of the CNs. 2.1 Taxonomy and Working Definitions for Several Types of CN A first CN taxonomy is defined in [10] addressing the large diversity of manifestation of collaborative networks in different application domains. Also a set of working definitions for the terms addressed in Fig. 1 are provided in [5]. A few of these definitions that are necessary for the later sections of this paper are quoted below from [5]; namely the definitions of collaborative networks (CN), virtual organizations breeding environments (VBE), virtual organizations (VO), etc. “A collaborative network (CN) is a network consisting of a variety of actors (e.g. organizations and people) that are largely autonomous, geographically distributed, and heterogeneous in terms of their operating environment, culture, social capital and goals, but that collaborate to better achieve common or compatible goals, and whose interactions are supported by computer network.” “Virtual Organization (VO) – represents an alliance comprising a set of (legally) independent organizations that share their resources and skills, to achieve their common mission / goal, but that is not limited to an alliance of profit enterprises. A virtual enterprise is therefore, a particular case of virtual organization.” “Dynamic Virtual Organization – typically refers to a VO that is established in a short time in order to respond to a competitive market opportunity, and has a short life cycle, dissolving when the short-term purpose of the VO is accomplished.” “Long-term strategic network or breeding environments – a strategic alliance established with the purpose of being prepared for participation in collaboration opportunities, and where in fact not collaboration but cooperation is practiced among their members. In other words, they are alliances aimed at offering the conditions and environment to support rapid and fluid configuration of collaboration networks, when opportunities arise.”
6
H. Afsarmanesh et al.
Collaborative Network (CN)
Main classes
Collaborative Networked Organization (CNO)
Long-term strategic network
VO Breeding Environment (VBE)
Examples
Ad-hoc Collaboration
Industry cluster
Goal-oriented network
Collaborative virtual lab Industrial district Business ecosystem
Disaster rescue net Inter-continental enterprise alliance
Continuous activity driven net
Grasping opportunity driven net
Professional Virtual Community (PVC)
Virtual Team (VT) created _within
Community of Active Senior Professionals (CASP)
Virtual Organization (VO)
Extended enterprise
Dynamic VO
Virtual Enterprise (VE)
Supply chain
Virtual government
Collaborative transportation network Dynamic Supply Chain
Disperse manufacturing
Fig. 1. Taxonomy of Collaborative Networks
“VO Breeding Environments (VBE) – represents “strategic” alliance of organizations (VBE members) and related supporting institutions (e.g. firms providing accounting, training, etc.), adhering to a base long-term cooperation agreement and adopting common operating principles and infrastructures, with the main goal of increasing both their chances and preparedness of collaboration in potential VOs”. “Profession Virtual Communities (PVC) is an alliance of professional individuals, and provide an environment to facilitate the agile and fluid formation of Virtual Teams (VTs), similar to what a VBE aims to provide for the VOs.” “Virtual Team (VT) is similar to a VO but formed by individuals, not organizations, as such a virtual team is a temporary group of professionals that work together towards a common goal such as realizing a consultancy job, a joint project, etc., and that use computer networks as their main interaction environment. 2.2 Base Requirements for Establishing CNs A generic set of requirements, including: (1) definition of common goal and vision, (2) performing a set of initiating actions, and (3) establishing common collaboration space, represents the base pre-conditions to the setting up of the CNs. Furthermore, after the CN is initiated the environment needs to properly operate, for which its coordination and management as well as reaching the needed agreements among its actors for performing the needed tasks represent other set of challenges, including: (1) performing coordination, support, and management of activities, and (2) achieving agreements and contracts. Following five sub-sections briefly address these main basic requirements as identified and addressed within the CN area of research, while emphasizing their information management challenges in italic. 2.2.1 Defining a Common Goal and Vision Collaboration requires the pre-existence of a motivating common goal and vision to represent the joint/common purpose for establishment of the collaboration. In spite of
Information Supporting Functional Dimension of Collaborative Networks
7
all difficulties involved in the process of cooperation / collaboration, the motivating factor for establishing the CNs is the expectation of being able to reach results that could not be reached by the involved actors if working alone. Therefore the common goal and vision of the CN represents its existential purpose, and represent the motivation for attraction of actors to the required cooperation/collaboration processes [5]. At present, the information related to the common goal and vision of the CNs is typically stored in textual format and is made available to public with proper interfaces. Establishing a well-conceived vision however needs involvement of all actors in the network. To properly participate in formulating the vision, the actors need to be well informed, which in turn requires the availability of up-to-date information regarding many aspects of the network. Both the management of required information for visioning as well as the assurance of its effective accessibility to all actors within the network is challenging, as later addressed through the development of ontology for CNs. 2.2.2 Performing a Set of Initiating Actions There are a number of initiating actions that need to be taken, as a pre-condition to establishing CNs. These actions are typically taken by the founder(s) of the CN and may include [11; 12]: identifying interested parties and bring them together; defining the scope of the collaboration and its desired outcomes; defining the structure of the collaboration in terms of leadership, roles, responsibilities; setting the plan of actions in terms of access to resources, task scheduling and milestones, decision-making plan; defining policies, e.g. for handling disagreements / conflicts, accountability, rewards and recognition, ownership of generated assets, intellectual property rights; defining the evaluation / assessment measures, mechanisms and process; and identifying the risks and planning contingency measures. Typically most information related to the initiating actions is strategic and considered proprietary to be accessed only by the CN’s administration. The classification of information in CNs to ensure its confidentiality and privacy, while guaranteeing enough access to the level required by each CN stakeholder is a challenging task for the information management system of the network’s administration. 2.2.3 Substantiating a Common Collaboration Space Establishing CNs require the pre-establishment of their common collaboration space. In this context, we define the term collaboration space as a generic term to address all needed elements, principles, infrastructure, etc. that together provide the needed environment for CN actors to be able to cooperate/collaborate with each other. Establishment of such spaces is needed to enable and facilitate the collaboration process. Typically it addresses the following challenges: - Common concepts and terminology (e.g. common meta-data defined for databases or an ontology, etc., specifying the collaboration environment and purpose) [13]. - Common communication infrastructure and protocols for interaction and data/information sharing and exchange (e.g. the internet, GRID, open or commercial tools and protocols for communication and information exchange, document management systems for information sharing, etc.) [14]
8
H. Afsarmanesh et al.
- Common working and sharing principles, value system, and policies (e.g. procedures for cooperation/collaboration and sharing different resources, assessment of collaboration preparedness, measurement of the alignment between value systems, etc.) [15; 16; 17]. The CN related principles and policies are typically modeled and stored by its administration and made available to all its stakeholders. - Common set of base trustworthiness criteria (e.g. identification, modeling and specification of periodic required measurements related to some common aspects of each actor that shall fall above certain threshold for all stakeholders, in order to ensure that all joining actors as well as the existing stakeholders possess minimum acceptable trust level [18]. It is necessary to model, specify, store, and manage entities and concepts related to trust establishment and their measurements related to different actors. - Harmonization/adaptation of heterogeneities among stakeholders due to external factors such as those related to actors from different regions involved in virtual collaboration networks, e.g. differences in time, language, laws/regulations, and socio-cultural aspects [19]. Some of these heterogeneities affect the sharing and exchange of information among the actors in the network, for which proper mappings and/or adaptors shall be developed and applied. There are certain other specific characteristics of CNs that require to be supported by their common collaboration space. For example some CNs may require simultaneous or synchronous collaboration, while others depend on asynchronous collaboration. Although remote/virtual collaboration is the most relevant case in collaborative networks, which may involve both synchronous and asynchronous interactions, some CNs may require the co-location of their actors [20]. 2.2.4 Substantiating Coordination, Supporting, and Management of Activities A well defined approach is needed for coordination of CN activities, and consequently establishment of mechanisms, tools, and systems are required for common coordination, support, and management of activities in the CN. A wide range of approaches can be considered for coordination of the CNs, among which one would be selected for each CN, based on its common goal and vision. Furthermore, depending on the selected coordination approach for each CN, management of its activities requires different supporting mechanisms and tools. For instance, on one side of the spectrum, for voluntary involvement of biodiversity scientists in addressing a topic of public interest in a community, a self-organized strategic alliance may be established. For this CN, a federated structure/coordination approach can be employed, where all actors have equal rights on decision making as well as suggesting ideas for the next steps/plans for the management of the CN that will be voted in this community. On the other end of the spectrum however, for car manufacturing, a goal-oriented CN may be established for which a fully centralized coordination approach can be applied, using a star-like management approach where most activities of the CN actors are fully guided and measured by one entity in the network, with almost no involvement from others. In practice all current goal-oriented CNs typically fall somewhere in between these two extreme cases. For the long term strategic CNs, the current trend is towards
Information Supporting Functional Dimension of Collaborative Networks
9
establishing different levels of roles and involvements for actors in leading and decision making at the CN level, with a centralized management approach that primarily aims to support CN actors with their activities and to guide them towards better performance. Short-term goal-oriented CNs on the other hand vary in their coordination approach and management. For instance in the product/services industry, typically these CNs are to the extent possible centralized in their coordination, and are managed in the style of a single organization. But for example looking into CNs in research, we see a different picture. For example in EC-funded research projects, usually the coordination of the consortium organized for the project is assumed by one or a few actors that represent the CN to the outside, but internally the management of activities is far more decentralized, and the decision making is in many cases done in federated manners and through voting. Nevertheless, and no matter which coordination approach is adopted, in order for CNs to operate successfully, their management requires a number of supporting tools and systems, which shall be determined and provided in advance of the establishment of the CNs. This subject is further addressed in Section 3 of this paper, where the functional dimension of the CNs and specifically the main required functionality for management of the long term strategic alliances are enumerated and exemplified. As one example, in almost all CNs, the involved actors need to know about each others’ capabilities, capacities, resources, etc. that is referred to as the competency of the involved actors in [21]. In breeding environments, either VBEs or PVCs, for instance, such competency information constitutes the base for the partner search by the broker/planner, who needs to match partners’ competencies against the characterization of an emerged opportunity in order to select the best-fit partners. Similarly, as an antecedent to any collaboration, some level of trust must pre-exist among the involved actors in the CN and needs to be gradually strengthened depending on the purpose of the cooperation/collaboration. Therefore, as addressed in [18], as a part of the CN management system, rational measurement of the performance and achievements of CN actors can be applied to determine the trustworthiness of its members from different perspectives. Considering these and other functionalities needed for effective management of the CNs, classification, storage, and manipulation of their related information e.g. for competencies of actors and their trust-related criteria need to be effectively supported and is challenging. 2.2.5 Achieving Agreements and Contracts among Actors Successful operation of the CN requires reaching common agreements/contracts among its actors [12; 22]. At the point of joining the CN, actors must agree on its common goals and to follow its vision during the collaboration process, towards the achievement of the common goal. They must also agree with the established common collaboration space for the CN, including the common terminology, communication infrastructure, and its working and sharing principles. Additionally, through the common collaboration space, a shared understanding of the problem at hands, as well as the nature/form of sharing and collaboration at the CN level should be achieved. Further on, clear agreements should be reached among the actors on the distribution of tasks and responsibilities, extent of commitments, sharing of resources, and the distribution of both the rewards and the losses and liabilities. Some details in relation
10
H. Afsarmanesh et al.
to these challenges are addressed in [23; 24]. The ownership and sharing of resources shall be dealt with, whether it relates to resources brought in by CN actors or resources acquired by the coalition for the purpose of performing the tasks. Successful collaboration depends on sharing the responsibilities by its actors. It is as important to have clear assignment of responsibilities during the process of achieving the CN goals, as afterwards in relation to liabilities for the achieved results. The level of commitment of actors shall be also clearly defined, e.g. if all actors are collectively responsible for all results, or otherwise. Similarly, division of gains and losses shall be agreed by the CN actors. Here, depending on the type of CN, its value system, and the area in which it operates, a benefit/loss model shall be defined and applied. Such a model shall address the perception of “exchanged value” in the CN and the expectations and commitment of its members. For instance, when it comes to the creation of intellectual property at the CN, its creation in most cases is not linearly related to the proportion of resources invested by each actor. Therefore, a fair way of determining the individual contribution to the results of the CN shall be achieved and applied to the benefit/loss model for the CN. Due to their relevance and importance for the successful operation of the CNs, detailed information about all agreements and contracts established with its actors are stored and preserved by CN administration. Furthermore, some CNs with advanced management systems, model and store these agreements and contracts within a system so that they can be semi-automatically enforced, for example such a system can issue automatic warnings when a CN actor has not fulfilled or has violated some timely terms of its agreement/contract. Organizing, processing, and interfacing the variety of information to different stakeholders, required to support both reaching agreement as well as enforcing them, is quite challenging. 2.3 Relation between the Strategic-Alliance CNs and the Goal-Oriented CNs Scarcity of resources / capacities owned by actors is at the heart of the motivation for collaboration. For instance, large organizations typically hesitate to collaborate with others, when and if they own sufficient resources and skills to fully respond to emerging opportunities. On the other hand, due to the lack of needed resources and skills, SMEs in different sectors increasingly tend towards collaboration and joining their efforts. Therefore, a main motivation for establishment of CNs is to create larger resources and skills set, in order to compete with others and to survive in turbulent markets. Even in the nature, we can easily find natural alliances among many different species (e.g. bees, ants, etc.), which form communities and collaborate to compete for increasing both their resources and their power, what is needed for their survival [25]. Therefore, in today’s market/society, we like to call the “scarcity of resources (e.g. capabilities/capacities) the mother of collaborative networks”. Nevertheless, even though bigger pool of resources and skills is generated through collaboration among individuals or organizations, and face variable needs as market conditions evolve, these pools are still limited and therefore should be dealt with through careful effective planning. But unlike the case of a single actor that is selfconcerned in its decision making, e.g. to approach or not approach an opportunity, in the case of collaborative networks decision-making on this issue is quite challenging and are usually addressed by an stakeholder acting as the broker/planner of goal-oriented
Information Supporting Functional Dimension of Collaborative Networks
11
CNs. Further to this decision, there are a large number of other challenges involved in the creation phase of goal-oriented CNs, e.g. selecting the best-fit partners for an emerged opportunity; namely finding the best potential actors, through effective matching of their limited resources and skills against the required characteristics of the emerged opportunity. Other challenges include the setting up of the common infrastructure, etc. as addressed in the previous section. Additionally, trust which is a fundamental requirement for any collaboration is a long-term process that cannot be satisfied if potential participants have no prior knowledge of each other. Many of these challenges either become serious inhibitors to the mere establishment of goaloriented CNs by their broker/planner, or constitute the serious cause for their failures in the later stages of CN’s life cycle [11]. As one solution approach, both research and practice have shown that creation/foundation of goal-oriented short term CNs, to respond to emerging opportunities, can both greatly benefit from the pre-existence of a strategic alliance/association of actors, and become both cost and time effective. A line of research in the CN discipline is therefore focused on these long-term alliances, starting with the investigation of the existing networks that act as such associations – the so called 1st generation strategic alliances, but focused specifically on expanding their roles and operations in the market/society, thus modeling and development of the next generation of such associations – the so called 2nd generation VBEs [26]. Research on strategic alliances of organizations and individuals on one hand focuses on providing the support environment and functionalities, tools and systems that are required to improve the qualification and positioning of this type of CNs in the market/society in accordance to its own goal, vision, and value system. Besides defining the common goal/vision, performing the needed initiating actions, and establishing the common collaboration space, the alliance plays the main role in coordinating activities of the association, and achieving agreements among its actors, towards their successful establishment of goal-oriented CNs. Therefore, a part of the research in this area focuses on establishing a strong management system for these types of CN [27], introducing the fundamental functionality and information models needed for their effective operation and evolution. These functionalities, also addressed later in this paper, address the management of information needed for effective day-to-day administration of activities in breeding environments, e.g. the engineering of CN ontology, management of actors’ competencies and profiles, and specification of the criteria and management of information related to measurement of the level of trust in actors in the alliance. Furthermore, a number of specific subsystems are needed in this environment to support the creation of goal-oriented short term CNs, including the search for opportunities in the market/society, matching opportunities against the competencies (resources, capacities, skills, etc.) available in the alliance, and reaching agreement/negotiation among the potential partners. On the other hand, this area of research focuses on measuring and improving the properties and fitness of the involved actors in the strategic alliance as a part of the goals of these breeding environments, aiming to further prepare and enable them for participation in future potential goal-oriented CNs. As addressed later in Section 4.1, the effective management of strategic alliance type of CNs heavily depends on building and maintaining strong information management systems to support their daily activities and variety of functionalities that they provide to their stakeholders.
12
H. Afsarmanesh et al.
3 Collaborative Networks Reference Model Recent advances in the definition of the CN taxonomy as well as the reference modeling of the CNs are addressed in [5; 28], and fall outside the scope of this paper. However, this section aims to provide brief introduction to the ARCON reference model defined for the CNs. The reference model in turn provides the base for developing the CN ontology, as well as modeling some of the base information needed to be handled in the CNs. This section further focuses on the endogenous perspective of the CNs, while Section 4 focuses specifically on the functional dimension of the CNs. Then Sections 5, 6, and 7 narrow down on the management of information for several elements of the functional dimension. 3.1 ARCON Reference Model for Collaborative Networks Reference modeling of CN primarily aims at facilitating the co-working and codevelopment among its different stakeholders from multi-disciplines. It supports the reusability and portability of its defined concepts, thus providing a model that can be instantiated to capture all potential CNs. Furthermore, it shall provide insight into the modeling tools/theories appropriate for different CN components, and the base for design and building of the architectural specifications of CN components. Inspired by the modeling frameworks introduced earlier in the literature related to collaboration and networking [3; 4; 29; 30] and considering the complexity of CNs [11; 10; 31], the ARCON (A Reference model for Collaborative Networks) modeling framework is developed addressing their wide variety of aspects, features, and constituting elements. The reference modeling framework of ARCON aims at simplicity, comprehensiveness and neutrality. With these aims, it first divides the CN’s complexity into a number of perspectives that comprehensively and systematically cover all relevant aspects of the CNs. At the highest level of abstraction, the three perspectives of environment characteristics, life cycle, and modeling intent are identified and defined for the ARCON framework, respectively constituting the X, Y, and Z axes of the diagrammatic representation of the ARCON reference model. First, the life cycle perspective captures the five main stages of the CNs’ life cycle, namely the creation, operation, evolution, metamorphosis, and dissolution stages. Second, the environment characteristics perspective further consists of two subspaces: the “Endogenous Elements subspace” capturing the characteristics of the internal elements of CNs, and the “Exogenous Interactions subspace” capturing the characteristics of the external interactions of the CNs with its logical surrounding. Third, the modeling intent perspective captures different intents for the modeling of CN features, and specifically addressing three possible modeling stages of general representation, specific modeling, and implementation modeling. All three perspectives and their elements are in detailed addressed in [1]. To enhance the understanding of the content of this paper, below we briefly address only the environment characteristics perspective, and then focus on the endogenous subspace. For more details on the life cycle and modeling intent perspectives, as well as
Information Supporting Functional Dimension of Collaborative Networks
13
the description of elements of the exogenous subspace, please refer to the above mentioned publication. 3.1.1 Environment Characteristics Perspective – Endogenous Elements Subspace To comprehensively represent its environment characteristics, the reference model for CNs shall include both its Endogenous elements, as well as its Exogenous Interactions [1]. Here we focus on the endogenous elements of the CN. For much more details on any of these issues the above reference to ARCON reference model is suggested. Abstraction and classification of CN’s endogenous elements is challenging due to the large number of their distinct and varied entities, concepts, functionality, rules and regulations, etc. For instance, every CN participant can play a number of roles and have different relationships with other CN participants. Furthermore, there are certain rules of behavior that either constitute the norms in the society/market, or set internal to the CN and shall be obeyed by the CN participants. Needless to say that in every CN there are a set of activities and functionalities needed for its operation and management that also need to be abstracted in its reference model. The Endogenous Elements subspace of ARCON aims at the abstraction of the internal characteristics of CNs. To better characterize these diverse set of internal aspects of CNs, four ortogonal dimensions are proposed and defined, namely the structural, componential, functional, and behavioral dimensions: • E1 - Structural dimension. Addressing the composition of CN’s constituting elements, namely the actors (primary or support), roles (administrator, advisor, broker, planner, etc.), relationships (trusting, cooperation, supervision, collaboration, etc.), and network topology (self and potentially sub-network) etc. • E2 - Componential dimension. Addressing the individual tangible/intangible CN elements, namely domain specific devices, ICT resources (hardware, software, networks), human resources, collected information, knowledge (profile/competeny data, ontologies, bag of assets, profile and competency data, etc.), and its accumulated assets (data, tools, etc.) etc. • E3 - Functional dimension. Addressing the “base functions / operations” that run to support the network, time-sequenced flows of executable operations (e.g. processes for the management of the CN, processes to support the participation and activities of members in the CN), and methodologies and procedures running at the CN (network set up procedure, applicant’s acceptance, CN dissolution and inheritance handling, etc.) etc. • E4 - Behavioral dimension. Addressing the principles, policies, and governance rules that either drive or constrain the behavior of the CN and its members over time, namely principles of governance, collaboration and rules of conduct (prescriptive or obligatory), contracts and agreements, and constraints and conditions (confidentiality, conflict resolution policies, etc.) etc. Diagrammatic representation of the cross between the life-cycle perspective and the Endogenous Elements, exemplifying some elements of each dimension is illustrated in Fig. 2.
14
H. Afsarmanesh et al.
L5. / Dissolution Dissolution
sis rp ho ta mo Me n Ev ol ut io
n
n tatio en eling
-Processes
-Prescriptive behavior
-Relationships
-Human res.
-Roles
-Auxiliary processes
-Obligatory behavior
-Information/ knowledgeres.
-Procedures -Methodologies
-Constraints& conditions
-Ontologyres.
L1. Creation
C re atio
Exo-I Abstractions
*
Endogenous Elements (Endo-E) (Endo -E)
Mod
E1. Structural
n
Sp
Endo-E Abstractions
-Contracts& agreements
*
-Network topology
*
-Hardware/ software res.
m t ple M od Im te n l In d e ec if ic ing el Mo
Ope ra
tio
CN O-Life-Cycle Stages
Di ss
ol ut io
L2. Operation
e.g.
e.g.
-Participants
*
n
L3. Evolution
e.g.
e.g.
CNO-Life-Cycle Stages
L4. L4. Metamorphosis Metamorphosis / Dissolution
n al tatio er G en esen pr Re
© H. Afsarmanesh & L.M. Camarinha-Matos 2007
E2. Componential
E3. Functional
E4. Behavioral
Inside view
Fig. 2. Crossing CN life cycle and the Endogenous Elements perspective [1]
The remaining of this paper focuses only on the functional dimension of the CN and in specific it addresses in more details the functionality required for effective management of the long term strategic alliances. In order to exemplify the involved complexity, It then further focuses down on the management of information required to support three specific functionality of ontology engineering, profile and competency management, and trust management within the functional dimension of this type of CNs. In specific the collection, modeling, and processing of the needed information for these functionalities that constitute three sub-systems of the management system for these type of CNs are addressed.
4 Functional Dimension of Collaborative Networks The detailed elements in the ARCON reference model that comprehensively represent the functional dimension of the CNs is addressed in [1], where also instantiations of the functional dimension of CNs for both the long-term strategic alliances as well as the shorter term goal-oriented networks are presented. This section specifically focuses on the long term strategic alliances of organizations or individuals. It first addresses the set of functionality that are necessary for both the management of daily operation of strategic alliances as well as those needed to support its members with their participation and activities in this type of CN. Managing variety of heterogeneous and distributed information is required within strategic alliances, such as the VBEs and PVCs to support their operation stage, as characterized in the functional dimension of these two types of CNs. For such networks to succeed, their administration needs to collect a wide variety of information partially from their involved actors and partially from the network environment itself, classify and organize this information to fit the need of their supporting sub-systems, and continuously keeping them updated and complete to the extent possible [32].
Information Supporting Functional Dimension of Collaborative Networks
15
Current research indicates that while the emergence of CNs delivers many exciting promises to improve the chances of success for its actors in current turbulent market/society, it poses many challenges related to supporting its functional dimension. This in turn results challenges for capturing, modeling and management of the information within these networks. Some of the main required functions to support both the management and the daily operation of the strategic alliances include: (i) engineering of network ontology, (ii) classification and management of the profile and competency of actors in the network, (iii) establishing and managing rational trust in the network, (iv) matching partners capabilities/capacities against requirements for collaboration opportunity, (v) reaching agreements (negotiation) for collaboration, and (vi) collection of the assets (data, software tools, lessons learned, best practices, etc.) gathered and generated in the network, and management of components in such bag of assets, among others [11; 27]. A brief description of a set of main functionalities is provided in the next sub-section. 4.1 Functionalities and Sub-systems Supporting Strategic Alliances Research and development on digital networks, particularly the Internet, addresses challenges related to the online search of information and the sharing of expertise and knowledge between organizations and individuals, irrespective of their geographical locations. This in turn paves the way for collaborative problem solving and cocreation of services and products, which go far beyond the traditional interorganizational or inter-personal co-working boundaries and geographical constraints, addressing challenging questions about how to manage information to support the cooperation of organizations and individuals in CNs. A set of functionalities are required to support the operation stage of strategic alliances. In particular, supporting the daily management of CN activities and actors, and their agile formation of goal-oriented CNs to address emerging opportunities are challenging. Furthermore, these functionalities handle variety of information, thus need effective management of their gathered information, considering the geographical distribution of and heterogeneous nature of the CN actors, such as their applied technologies, organizational culture, etc. Due to the specificities of the functionalities required for management of strategic alliances, developing one large management system for this purpose is difficult to realize and maintain. A distributed architecture is therefore typically considered for their development. Applying the service orientation approach, a number of interoperable independent sub-systems can be developed to and applied, that in turn requires support the management of their collaboration-related information in the strategic alliances. As an example for such development and required functionality, as addressed in [27], a so-called VBE management system (VMS) is designed and implemented constituting a number of inter-operable subsystems. These sub-systems either directly support the daily management of the VBE operation [33], or are developed to assist the opportunity-broker and the VO-planner with effective configuration and formation of the VOs in the VBE environment [34]. In Fig. 3, eight specific functionalities address the subsystems supporting the management of daily operation of VBEs, while four specific functionalities, appearing inside the VO creation box, address the subsystems supporting different aspects related to the creation of VOs.
16
H. Afsarmanesh et al.
Focusing on their information management aspects, each of these sub-systems provide a set of services related to their explicit access, retrieve, and manipulation and of information for different specific purposes These subsystems interoperate through exchanging their data, and together provide an integrated management system as shown in Fig. 3. For each subsystem illustrated in this figure, a brief description is provided below, while more details can be found in the above two references. 2
ODMS A
M
DSS
14
1
PCMS
Low performance
3
7
14
A
2
A
14
6 7 MSMS Member registration
6
8 TrustMan A
A
A
DSS
DSS
5
Lack of competency
Low trust
M
A
17
VIMS VO inheritance
VIMS VO registration
A
M
VOMS 13
B
9 14
10
13 VO creation
4 6 CO-Finder 10 B
COC-Plan 11
A
PSS
B
WizAN
12
B
B
Main users/editors of data in the systems / tools: SIMS S
15
BAMS
16
Value system S
M
MSMS
M
A
B
S
rewarding
A
A
VBE Member VBE Administrator Broker
Support Institution Manager
Data transfer
1 2 3 4 5
Profile/competency classification Profile/competency element classification Member’s competency specification Competency classes Low base trust level of organizations
6 7 8 9 10 11
Members’ general data Bas trust level of membership applicants Specific trustworthiness of VO partners Organizations’ performance data from the VO Collaborative opportunities’ definitions VO model
12 13 14 15 16 17
VO model and candidate partners VO model and VO partners Processed VO inheritance Support institutions’ general data Asset contributors’ general data VO inheritance
Fig. 3. VMS and its constituent subsystems
Membership Structure Management Systems (MSMS): Collection and analysis of the applicants’ information as a means to ascertain their suitability in the VBE has proved particularly difficult. This subsystem provides services which support the integration, accreditation, disintegration, rewarding, and categorization of members within the VBE. Ontology Discovery Management Systems (ODMS): In order to systematize all VBE-related concepts, a generic/unified VBE ontology needs to be developed and managed. The ODMS system provides services for the manipulation of VBE ontologies, which is required for the successful operation of the VBE and its VMS as further addressed in Section 5. Profile and Competency Management Systems (PCMS): In VBEs, several functionalities need to access and process the information related to members’ profiles and competencies. PCMS provides services that support the creation, submission, and
Information Supporting Functional Dimension of Collaborative Networks
17
maintenance of profiles and detailed competency related elements of the involved VBE organizations, as well as categorizing collective VBE competencies, and organizing competencies of VOs registered within the VBE, as further addressed in Section 6. Trust Management system (TrustMan): Supporting the VBE stakeholders, including the VBE administration and members, with handling tasks related to the analysis and assessment of rational trust level for other organizations is of great importance for successful management and operation of the VBEs, such as the selection of best fit VO partner as further addressed in Section 7. Decision Support Systems (DSS): The decision making process in a VBE needs to involve a number of actors whose interests may even be contradictory. The DSS has three components that support the following operations related to decisionmaking within a VBE, namely: Warning of an organization’s lack of performance, Warning related to the VBE’s competency gap, and Warning of an organization’s low level of trust. VO information management system (VIMS): It supports the VBE administrator and other stakeholders with management of information related to the creation stage of the VOs within the VBE, storing summary records related to measurement of performance during the VO’s operation stage, and recording and managing of information and knowledge gathered from the dissolved VOs, which constitute means to handle and access inheritance information. Bag of assets management system (BAMS): It provides services for management and provision of fundamental VBE information, such as the guidelines, bylaws, value systems guidelines, incentives information, rules and regulations, etc. It also supports the VBE members with publishing and sharing some of their “assets” of common interest with other VBE members, e.g. valuable data, software tools, lessons learned etc. Support institution management system (SIMS): The support institutions in VBEs are of two kinds. The first kind refers to those organizations that join the VBE to provide/market their services to VBE members. These services include advanced assisting tools to enhance VBE Members’ readiness to collaborate in VOs. They can also provide services to assist the VBE members with their daily operation, e.g. accounting and tax, training, etc. The second kind refers to organizations that join the VBE to assist it with reaching its goals e.g. ministries, sector associations, chamber of commerce, environmental organizations, etc. SIMS supports the management of the information related to activities of support institutions inside the VBEs. Collaboration Opportunity Identification and Characterization (coFinder): This tool assists the opportunity broker to identify and characterize a new Collaboration Opportunity (CO) in the market/society that will trigger the formation of a new VO within the VBE. A collaboration opportunity might be external, initiated by a customer and brokered by a VBE member that is acting as a broker. Some opportunities might also be generated internally, as part of the VBE’s development strategy. CO characterization and VO’s rough planning (COC-plan): This tool supports the planner of the VO with developing a detailed characterization of the CO needed resources and capacities, as well as with the formation of a rough structure for the
18
H. Afsarmanesh et al.
potential VO, therefore, identifying the types of required competencies and capacities needed from organizations that will form the VO. Partners search and suggestion (PSS): This tool assists the VO planner with the search for and proposal of one or more suitable sets of partners for VO configurations. The tool also supports an analysis of different potential VO configurations in order to select the optimal formation. Contract negotiation wizard (WizAN): This tool supports the VO coordinator to involve the selected VO partners in the negotiating process, agreeing on and committing to their participation in the VO. The VO is launched once the needed agreements have been reached, contracts established, and electronically signed. About managing the information in CNs, a summary of the main related challenges are presented in section 1.1, where several requirements are addressed, in relation to different aspects and components of the CNs. The next three sections focus down and provide details on the information management aspects of three of the above functionalities and address their subsystems, namely the ontology engineering, the management of profiles and competencies and the management of trust, in this type of CNs.
5 VBE-Ontology Specification and Management Ontologies are increasingly applied to different areas of research and development, for example they are effectively used in artificial intelligence, semantic web, software engineering, biomedical informatics, library science, among many others, as the means for representing knowledge about their environments. Therefore, a wide variety of tasks related to processing information/knowledge is supported through the specification of ontologies. As examples for these tasks we can mention: natural language processing, knowledge management, geographic information retrieval, etc. [35]. This section introduces an ontology developed for VBEs that aims to address a number of challenging requirements for modeling and management of VBE information. It first presents the challenges being addressed and then sections 5.1 and 5.2 present two specific ontology-based solutions. 5.1 Challenges for VBE Information Modeling and Management The second generation VBEs must handle a wide variety and types of information related to both their constituents and their required daily operations and activities. Therefore, these networks must handle and maintain a broad set of concepts and entities to support processing of a large set of functionalities. Among others, complexity, dynamism, and scalability requirements can be identified as characteristics describing the VBEs and their following aspects: (i) autonomous geographically distributed stakeholders, (ii) wide range of running management functionalities and support systems, and (iii) diverse domains of activities and application environments. The analysis of several 1st generation VBEs in different domain has shown that the development of an ontology for VBEs can address their following main requirements:
Information Supporting Functional Dimension of Collaborative Networks
19
Establishing common understanding in VBEs. Common understanding of the general as well as domain-related VBE concepts is the base requirement for modelling and management of information/knowledge in different VBE functionalities. To facilitate interoperability and smooth collaboration, all VBE stakeholders must use the same definition and have the same understanding of different aspects and concepts applied in the VBE, including: VBE policies, membership regulations, working/sharing principles, VBE competencies, performance measurement criteria, etc. There is still a lack of consensus on the common and coherent definitions and terminology addressing the generic VBE structure and operations [36]. Therefore, identification and specification of common generic VBE terminology, as well as development of a common semantic subspace for VBE information/knowledge is challenging. VBE instantiation in different domains. New VBEs are being created and operated in a variety and range of domains and application environments, from e.g. the provision of healthcare services, and the product design and manufacturing to the management of natural disasters, biodiversity and the scientific virtual laboratory experimentations in physics or biomedicine, among others. Clearly, each domain/application environment has its own features, culture, terminology, etc. that shall be considered and supported by the VBEs’ management systems. During the VBE’s creation stage, parameterization of its management system with both the generic VBE characteristics as well as with the specific domain-related and application-related characteristics is required. Furthermore at the creation stage of the VBE, several databases need to be created to support the storage and manipulation of the information/knowledge handled by different sub-systems. Design and development of these databases shall be achieved together with the experts from the domain, requiring knowledge about complex application domains. Therefore, development of approaches for speeding up and facilitating instantiation and adaptation of VBEs to different domains / areas of activity is challenging. Supporting dynamism and scalability in VBEs. Frequent changes in the market and society, such as the emergence of new types of customer demands or new technological trends drive VBEs to work in a very dynamic manner. Supporting dynamic aspects of VBEs require that the VBE management system is enabled by functionalities that support human actors with necessary changes in the environment. The VBE’s information is therefore required to be processed dynamically by semi-automated reusable software tools. As such, the variety of VBE information and knowledge must be categorized and formally specified. However, there is still a lack of such formal representations and categorizations. Therefore formal modelling and specification of VBE information, as well as development of semi-automated approaches for speeding up the VBE information processing is challenging. Responding to the above challenges through provision of innovative approaches, models, mechanisms, and tools represents the main motivation for the research addressed in this section. The following conceptual and developmental approaches together address the above three challenges: • Conceptual approach - unified ontology: The unified ontology for VBEs, which is further referred to as the VBE-ontology described as follows [13]:
20
H. Afsarmanesh et al.
VBE-ontology is a form of unified and formal conceptual specification of the heterogeneous knowledge in VBE environments to be easily accessed by and communicated between human and application systems, for the purpose of VBE knowledge modelling, collection, processing, analysis, and evolution. Specifically, the development of the unified VBE-ontology supports responding to the challenge of common understanding as follows: (i) supports to represent definitions of all VBE concepts and the relationships among concepts within a unified ontology that establishes the common semantic subspace for the VBE knowledge; (ii) introduces linguistic annotations such as synonyms and abbreviations to address the problem of varied names for concepts; (iii) through sharing the VBE-ontology within and among VBEs, supports reusing common concepts and terminology. In relation to the challenge of VBE instantiation, the VBE ontology addresses it as follows: (1) the ontological representation of VBE knowledge is semi-automatically convertible/transferable to database schemas [37] supporting the semi-automated development of the needed VBE databases during the VBE creation stage; (2) pre-defined domain concepts within the VBE-ontology support the semi-automated parameterization of generic VBE management tools, e.g. PCMS, TrustMan, etc. In relation to the challenge of VBE dynamism and scalability, the developed VBE-ontology responds it in the following way: (a) formal representation of the knowledge in the VBE-ontology facilitates semi-automated processing of this knowledge by software tools; (b) the ontology itself can be used to support semi-automated knowledge discovery from text-corpora [38]. • Developmental approach - ontology discovery and management system: In order to benefit from the VBE-ontology specification, a number of ontology engineering and management functionalities are developed on top of the VBE-ontology [39]. Namely, the ontology engineering functionalities support discovery and evolution of the VBE-ontology itself, while the ontology management functionalities support VBE stakeholders learning about VBE concepts, preserve the consistency among VBE databases and domain parameters with the VBE-ontology, and perform semiautomated information discovery. These needed functionalities are specified and developed within one system, called Ontology Discovery and Management System (ODMS) [39]. The ODMS plays a special role in the functional dimension of the CN reference model (as addressed in section 4). Unlike other information management sub-systems addressed in section 4.1, e.g. profile and competency management, trust management, etc., ODMS does not aim at management of only real information/data of the VBE, but also of the ontological representation of its conceptual aspects, namely the meta-data. Precisely, this sub-system aims to support the mapping of information handled in other VMS sub-systems to its generic meta-models. This mapping supports consistency between the portions of information accumulated by different VMS sub-systems and their models. It also supports preserving semantics of the information, which is the first step for development of semi-automated and intelligent approaches for information management. The remaining of this section further describes the above two approaches in more details.
Information Supporting Functional Dimension of Collaborative Networks
21
5.2 VBE-Ontology To define the scope of the VBE-ontology, first the VBE information and knowledge are characterised and categorised. The two following main characteristics of the VBE information / knowledge are used to categorise them: • Reusable VBE information at different levels of abstraction: In order to respond to the challenge of common understanding, the VBE-ontology is primarily addressed in three levels of abstraction, called here “concept-reusability levels” that refer to reusability of the VBE information at core, domain, and application levels (see Fig. 4). The core level constitutes the concepts that are generic for all VBEs, for example concepts such as “VBE member”, “Virtual Organization”, “VBE competency”, etc. The domain level has a variety of “exemplars” – one for each specific domain or business area. Each domain level constitutes the concepts that are common only for those VBEs that are operating in that domain or sector. Domain level concepts constitute population of the core concepts into a specific VBE domain environment. For example the core “VBE competency” concept can be populated with “Metalworking competency” or “Tourism competency” depending on the domain. The application level includes larger number of exemplars – one for each specific VBE application within each domain. Each application level constitutes the concepts that are common to that specific VBE and cannot be reused by other VBEs. Application level concepts mainly constitute population of the domain level concepts into one specific VBE application environment. The levels of reusability also include one very high level called meta level. This level represents a set of high level meta-properties, such as “definition”, “synonym”, and “abbreviation”, used for specification of all concepts from the other three levels. • Reusable VBE information in different work areas: In order to respond to the challenge of VBE creation in different domains and the challenge of VBE dynamism and scalability, the concepts used by different VBE management functionalities, as addressed in the functional dimension of the CN reference model, should be addressed in the VBE-ontology. Therefore, the VBE-ontology supports both: development of the databases for VBE functionality related data, and the semi-automated processing of these functionalities related information. Additionally, addressing these concepts in the VBE-ontology responds to the challenge of common understanding for these functionalities. Following the approach addressed in [40] for the AIAI enterprise ontology, ten different “work areas” are identified for VBEs and their management (see Fig. 4). Each work area focuses on a set of interrelated concepts that are typically assi8ciated with a specific VBE document repository and/or in a specific VBE management functionality, such as: the Membership Management functionality, management of Bag of Assets repository, Profile and Competency management, Trust managements as addressed in section 4.1. These work areas are complimentary and each of them has some concepts that it shares with some other work areas. In addition, while extensive attention is spent on the design of these ten work areas, it is clear that in future more work areas can be defined and added to the VBE-ontology. Additionally, each of the ten work areas can be further split into some smaller work areas depending of the
22
H. Afsarmanesh et al.
VBE management system
VBE value systems
VBE governance
VBE trust
VBE bag of assets
VBE history
VBE profile and competency
Virtual organization
VBE actor / participant
VBE-self
details they need to capture. For example from the Profile and Competency work area, the Competency work area can be separated from Profile work area. The introduced structure of the VBE-ontology represents (a) embedding of the “horizontal” reusability levels and (b) intersection of them with the “vertical” work areas. The horizontal reusability levels are embedded in each other hierarchically. Namely, the core level includes the meta level, as illustrated in Fig. 4. Furthermore, every domain level includes the core level. Finally, every application level may include a set of domain levels (i.e. those related to this VBE’s domains of activity). The work areas are presented vertically, and thus intersect with the core level, domain level, and application levels, but not with the meta level, that consists of the meta-data applicable to all other levels. The cells resulted from the intersection of the reusability levels and the work areas are further called sub-ontologies. The structure of the VBEontology is also illustrated in Fig. 4. Particularly this figure addresses how intersection of the horizontal core level and the vertical VBE profile and competency work area results into the core level profile and competency sub-ontology. The idea behind the sub-ontologies is to apply the divide and rule principle to the VBE-ontology in order to simplify coping with its large size and wide variety of aspects. Furthermore, sub-ontologies represent the minimal physical units of the VBEontology, i.e. physical ontology files on a computer, while the VBE-ontology itself shall be compiled out of its physical sub-ontologies according to its logical structure. Sub-ontologies also help to cope with evolution of different VBE information. Every time when a new piece of information needs to be introduced in the VBE-ontology, only the relevant new sub-ontology for that information can be specified for it within the VBE-ontology. Typically, when a new VBE is established, it does not need to adapt the entire VBE-ontology. Rather it should build its own “application VBEontology” out of related sub-ontologies as out of the “construction bricks”. Thus, the design of the VBE-ontology also provides solutions to the technical question about the differences in information accumulated by different VBE applications.
Levels of abstraction
Application Domain Core
“Trust management” work area Core level “profile and competency” sub-ontology
Meta
Fig. 4. Structure of the VBE-ontology consisting of sub-ontologies
Information Supporting Functional Dimension of Collaborative Networks
23
One partial screenshot from the developed sub-ontology for the core level of the profile and competency information is addressed below in Fig. 5, as also later addressed in section 6.
Fig.5. Partial screen-shot of the VBE profile and competency sub-ontology (at the core level)
5.3 ODMS Subsystem Functionalities The ODMS (Ontology Discovery and Management System) functionalities aim to assist the main information management processes and operations that take place through the entire life-cycle of a VBE. They include both ontology engineering functionalities that are needed for maintaining the VBE-ontology itself, and ontology management functionalities that are needed to support VBE information management. The five specified functionalities for ODMS include: • Sub-ontology registry: In order to maintain the sub-ontologies of the VBEontology, this functionality, rooted in [41; 42], aims at uploading, registering, organizing, and monitoring the collection of sub-ontologies within an application VBE-ontology. Particularly, it aims at grouping and re-organizing sub-ontologies for further management, partitioning, integration, mapping, and versioning. • Sub-ontology modification: This functionality aims at manual construction and modification of sub-ontologies. Particularly it has an interface through which users can perform operations of introducing new concepts and adding definitions, synonyms, abbreviations, properties, associations and inter-relationships for the existing concepts. The concepts in sub-ontologies are both represented in a textual format as well as visualized through graphs or diagrams. • Sub-ontology navigation: This functionality aims at familiarising VBE members with the VBE terminology and concepts, and thus addressing the challenge of common understanding. In order to view the terminology, the VBE members first select a
24
H. Afsarmanesh et al.
specific sub-ontology from the registry. The concepts in sub-ontologies are also both represented in a textual format as well as visualized through graphs or diagrams. • Repository evolution: This functionality supports establishment and monitoring of consistency between VBE database schemas (as well as content in some cases) and their related sub-ontologies, and thus addresses the challenge of VBE instantiation in different domains. In response to this challenge, the VBE databases shall be developed semiautomatically guided by the VBE-ontology. Several approaches for conversion of subontologies into database schemas suggest creation of a map between an ontology and a database schema [37]. This map later supports monitoring consistency between these ontology and database schema. Specifically, this functionality aims to indicate if the database schemas need to be updated after changes to the VBE-ontology. • Information discovery: This functionality, rooted in [38], aims at semiautomated discovery of information from text-corpora, based on the VBE-ontology, which addresses the challenge of VBE dynamism and scalability. Particularly, the information discovery functionality supports discovery of relevant information about the VBE member organizations in order to augment the current VBE repositories. The text-corpora used by this functionality can include semi-structured (e.g. HTMLpages) or unstructured sources (e.g. brochures). These are typically provided by VBE member organizations.
6 Profile and Competency Modeling and Management To support both the proper cooperation among the VBE members and the fluid configuration and creation of VOs in the 2nd generation VBEs, it is necessary that all VBE members are characterised by their uniformly formatted “profiles”. This requirement is especially severe in the case of the medium- to large-size VBEs, where the VBE administration and coaches have less of a chance to get to know directly each VBE member organization. As such, profiles shall contain the most important characteristics of VBE members (e.g. their legal status, size, area of activity, annual revenue, etc.) that are necessary for performing fundamental VBE activities, such as search for and suggestion of best-fit VO partners, VBE performance measurement, VBE trust management, etc. The VBE “competencies” represent a specific part of the VBE member organizations’ profiles that is aimed to be used directly for VO creation activities. Competency information about organizations is exactly what the VO broker and/or planner needs to retrieve, in order to determine what an organization can offer for a new VO. This section first addresses the specific tasks in VBEs that require handling of profiles and competencies, and then it presents the two complimentary solution approaches developed for solving these tasks. 6.1 Task Requiring Profile and Competency Modeling and Management In the 2nd generation VBEs, characteristic information about all VBE members should be collected and managed in order to support the following four tasks [43].
Information Supporting Functional Dimension of Collaborative Networks
25
• Creation of awareness about potentials inside the VBE. In order to successfully cooperate in the VBE and further successfully collaborate in VOs, the VBE members need to familiarize with each other. In small-size VBEs, e.g. with less then 30 members, VBE members may typically have the chance to get to know each other directly. However, this becomes increasingly more difficult and even impossible in the geographically dispersed medium-size and large-size VBEs (e.g. with 100-200 members). Thus uniformly organizing the VBE members’ information, e.g. to represent the members’ contact data, industry sector, vision, role in the VBE, etc., is a critical instrument supporting awareness of the VBE members about each other. • Configuration of new VOs. The information about the VBE member organization is needed to be accessed by both the human individuals and the software tools assisting the VO broker / VO planner in order to suggest configuration of the VOs with best-fit partners. Therefore the information about members’ qualification, resources, etc. that can be offered to a new VO needs to be structured and represented in a uniform format. • Evaluation of members by the VBE administration. At the stage of evaluating the member applicants and also during the VBE members’ participation in the VBE, the VBE administration needs to evaluate the members’ suitability for the VBE. The members’ information is also needed for automated assessment of members’ collaboration readiness, trustworthiness, and their performance, supported by software tools. • Introduction / advertising the VBE in the marker / society. Another reason for collection and management of VBE members’ information is to introduce / advertise the VBE to the outside market / society. Therefore, summarized information about the registered VBE members can be used to promote the VBE towards potential new customers and therefore against new collaborative opportunities. Collection of the members’ information in a unified format especially supports harmonising/adapting heterogeneities among VBE members, which represents one requirement for substantiation of a common collaboration space for CNs, as addressed in section 2.2.3. The profile for VBE member organization represents a separate uniformly formatted information unit, and is defined as follows: The VBE member organization’s profile consists of the set of determining characteristics (e.g. name, address, capabilities, etc.) about each organization, collected in order to facilitate the semi-automated involvement of each organization in some specific line of activities / operations in the VBE that are directly or indirectly aimed at VO creation. An important part of the profile information represents the organization’s competency, which is defined as follows: Organizations’ competencies in VBEs represent up-to-date information about their capabilities, capacities, costs, as well as conspicuities, illustrating the accuracy of their provided information, all aimed at qualifying organizations for VBE participation, and mostly oriented towards their VO involvement. The remaining of this section addresses two solution approaches, a conceptual one and a developmental one, that together address the above task.
26
H. Afsarmanesh et al.
6.2 Profile and Competency Models The main principle used for definition of the unified profile structure is identification of the major groups of the organization’s information. Following are the identified categories of profile information: 1. VBE-independent information includes those organization’s characteristics that are independent of the involvement of the organization in any collaborative and cooperative consortia. 2. VBE-dependent information includes those organization’s characteristics that are dependent on the involvement of the organization in collaborative and cooperative consortia within the VBEs, VOs, or other types of CNs. 3. Evidence documents are required to represent the indication / proof of validity of the profile information provided by the organizations related to the two previous categories of information. An evidence can either be an on-line document or some web accessible information, e.g. organization’s brochures, web-site, etc. The above mentioned four tasks can then be addressed by the profile model, as follows: Creation of awareness about potentials inside the VBE: addressed through basic information about name, foundation date, location, size, area of activity, general textual description of the organization. Configuration of new VOs: handled through name, size, contact information, competency information (addressed below), and financial information. Evaluation of members by the VBE administration: addressed through records about past activities of organizations, including past collaboration/cooperation activities, as well as produced products/services and applied practices. Introduction / advertising the VBE in the marker / society: addressed through aggregation of characteristics such locations, competencies, and past history of its achievements. The resulting profile model is presented in Fig. 6. The main objective of the competency model for VBE member organizations, which is called “4C-model of competency”, is the “promotion of the VBE member organizations towards their participation in future VOs”. The main technical challenge for the competency modelling is the unification of existing organizations competency models, e.g. as addressed by [44; 45]. Although, these competency models are developed for other purposes than the 2nd generation VBE, some of their aspects can be applied to VBE members’ competencies [21]. However the main principle for specification of the competency model is to organize different competency related aspects. These are further needed to search VBE members that best fit some requirements of an emerged collaboration opportunity. The resulting 4C competency model is unified and has a compound structure. The primary emphasis of this model goes to the four following components, which are identified through our experimental study as necessary and sufficient: 1. Capabilities represent the capabilities of organizations, e.g. their processes and activities. When collective business processes are modelled for a new VO, the VO planner has to search for specific processes or activities that can be performed by different potential organizations, an order to instantiate the model.
Information Supporting Functional Dimension of Collaborative Networks
27
Fig. 6. Model of the VBE member’s profile
2. Capacities represent free availability of resources needed to perform each capability. Specific capacities of organizations are needed to fulfil the quantitative values of capabilities, e.g. amount of production units per day. If the capacity of members for a specific capability in the VBE is not sufficient to fulfil market opportunities, another member (or a group of members) with the same capability may be invited to the VBE. 3. Costs represent the costs of provision of products/services in relation to each capability. They are needed to estimate if invitation of a specific group of members to a VO does not exceed the planned VO budget. 4. Conspicuities represent means for the validity of information provided by the VBE members about their capabilities, capacities and costs. The conspicuities in VBEs mainly include certified or witnessed documents, such as certifications, licenses, recommendation letters, etc. An illustration of the generic 4C-model of competency, applicable to all variety of VBEs, is addressed in Fig. 7. 6.3 PCMS Subsystem Functionalities Based on the objective and identified requirements, PCMS (Profile and Competency Management System) supports the following four main functionalities. • Model customization: This functionality aims at management of profile and competency models within a specific VBE application. The idea for this functionality is to support the customization of the VBE for a specific domain of activity or a specific application environment. Prior to performing the profile and competency management at the VBE creation stage, the profile and competency models need to be specified and customized.
28
H. Afsarmanesh et al.
Fig. 7. Generic 4C-model of competency
• Data submission: This functionality supports uploading of profile and competency knowledge from each member organization. The approach for incremental submission of data is developed for the PCMS. This approach specially supports uploading of large amounts of data. To support the dynamism and scalability of PCMS, the advanced ODMS’s mechanism for ontology-based information discovery is applied. • Data navigation: This functionality needs to be extensive in the PCMS. It supports different ways for retrieval and viewing of the profile and competency knowledge accumulated in the VBE. The navigation scope addresses both: single profile information as well as the collective profile information of the entire VBE. Structuring of the knowledge in the PCMS’s user interface mimics the VBE profile and competency sub-ontology. 5. Data analysis: PCMS shall collect the competency data and analyze it in order to evolve the VBE’s collection of competencies for addressing more opportunities in the market and society. A number of analysis functions are specified for the PCMS including: data validation, retrieval and search, gap analysis, and development of new competencies.
7 Modeling and Management of Trust in VBEs Traditionally, trust among organizations involved in collaborative networks was established both “bi-laterally” and “subjectively” based on reputation. In large networks, particularly with geographical dispersion, such as many VBEs however, trust issues are sensitive at the network level, and need to be reasoned/justified for example when applied to the selection of best-fit organizations among several competitors [18]. Thus in VBEs, analysis of inter-organizational trust is a functionality supported
Information Supporting Functional Dimension of Collaborative Networks
29
by the VBE administration, which needs to apply fact-based data such as current organizations’ standing and performance data, for its assessment. Thus, a variety of strategic information related to trust aspects must be collected from VBE actors (applying pull/push mechanisms), then modeled and classified, and stored a priori to assessing the level of trust in VBE organizations. Furthermore, in order to identify the common set of base trust related criteria for organizations in the VBE, as briefly addressed in section 2.2.3, relevant elements for each specific VBE must be determined. These trust criteria together with their respective weights constitute the threshold for assessment of organization’s level of trust in VBEs. In the past manual ad-hoc manners were applied to the manipulation and processing of organizations’ information related to trust. This section addresses the development of the Trust Management (TrustMan) subsystem at the VBE and describes its services supporting the rational assessment of level of trust in organizations. 7.1 Requirements for Managing Trust-Related Information Objectives for establishing trust may change with time, which means the information required to support the analysis of the trust level of organizations will also vary with time. As addressed in [18] a main aim for management of trust in VBEs is to support the creation of trust among VBE member organizations. The introduced approach to support inter-organizational trust management applies the information related to organization’s standing as well as its past performance data, in order to determine its rational trust level in the VBE. Thus organizations’ activities within the VBE, and their participation in configured VOs are relevant to be assessed. Four main information management requirements are identified which need to be addressed for supporting management of trust-related information in VBEs, as follows: Requirement 1 – Characterization of wide variety of dynamic trust-related information: The information required to support the establishment of trust among organizations is dynamic, since depending on the specific objective(s) for which the trust must be established, the needed information may change with time, and these changes cannot be predicted. Therefore, characterization of relevant trust-related information needed to support the creation of trust among organizations, for every trust objective, is challenging. Requirement 2 - Classification of information related to different trust perspectives: As stated earlier, analysis of trust in organizations within VBEs shall rely on factbased data and needs to be performed rationally. For this purpose some measurable criteria need to be identified and classified. The identification and classification of a comprehensive set of trust criteria for organizations is challenging, especially when considering different perspectives of trust. Requirement 3 - Processing of trust-related information to support trust measurement: In the introduced approach, the trust in organizations is measured rationally using fact-based data. For this purpose, formal mechanisms must be developed using a set of relevant trust criteria. In addition to measuring trustworthiness of organizations, the applied mechanisms should support fact-based reasoning about the results
30
H. Afsarmanesh et al.
based on the standing and performance of the organizations. The development of such trust-related information processing mechanisms is challenging. Requirement 4 - Provision of services for analysis and measurement of trust: Measurement of trust in organizations involves the computation of fact-based data using complex mechanisms that may need to be performed in distributed and heterogeneous environments. Development of services to manage and process trust-related information for facilitating the analysis of trust is challenging. 7.2 Approaches for Managing Trust-Related Information of Organizations Below we propose some approaches to address the four requirements presented above. We address the establishment of a pool of generic set of trust criteria, the identification and modeling of trust elements, the formulation of mechanisms for analyzing inter-organizational trust, and the designing of the TrustMan system. 7.2.1 Approach for Establishment of a Pool of Generic Concepts and Elements Solutions such as specialized models, tools or mechanisms developed to support the management of trust within “application specific VBEs” or within “domain specific VBEs” are difficult to replicate, adapt and reuse in different environments. Therefore, there is a need to develop a generic pool of concepts and elements that can be customized for every specific VBE. Generic set of trust criteria for organizations: In the introduced approach a large set of trust criteria for VBE organizations is identified and characterized through applying the HICI methodology and mechanisms (Hierarchical analysis, Impact analysis and Causal Influence analysis) [46]. The identified trust elements for organizations are classified in a generalization hierarchy as shown in Fig. 8. Trust objectives and five identified trust perspectives are generic, cover all possible VBEs, and do not change with time. A set of trust requirements and trust criteria can be identified at the VBE dynamically and changes with time. Nevertheless, a base generic set of trust requirements and trust criteria is so far established that can be expanded/customized depending on the VBE. Fig. 8 presents an example set of trust criteria for economical perspective. 7.2.2 Approach for Identification, Analysis and Modeling of Trust Elements To properly organize and inter-relate all trust elements, an innovative approach was required. The HICI approach proposed in [46] constitutes three stages, each one focusing on a specific task related to the identification, classification and interrelation of trust criteria related to organizations. The first stage called the Hierarchical analysis stage focuses on the identification of types of trust elements and classifying them through a generalization hierarchy based on their level of measurability. This classification enables to understand what values can be measured for the trust related elements which in turn supports the decision on what attributes need to be included in the database schema. A general set of trust criteria is presented in [18] and exemplified in Fig. 8.
Information Supporting Functional Dimension of Collaborative Networks
Trust perspective
Trust requirements
31
Trust criteria Cash capital
Creating trust among organizations
Capital
Structural perspective
Cash in
Financial stability
Managerial perspective
Cash out Net gains Operational cost
Economical perspective Social perspective
Physical capital Material capital
Technological perspective
VO cash in
VO financial stability
VO cash out VO net gains
Financial standards
Auditing standards Auditing Frequency
Fig. 8. An example set of trust criteria for organizations
The second stage called the Impact analysis stage focuses on the analysis of the impacts of changes in values of trust criteria on the trust level of organizations. This enables to understand the nature and frequency of change of values of trust criteria in order to support the decision regarding the frequency of updates for the trust-related information. The third stage called the Causal Influence analysis stage focuses on the analysis of causal relations between different trust criteria as well as between the trust criteria and other VBE environment factors, such as the known factors within the VBE and intermediate factors, which are defined to link the causal relations among all trust criteria and known factors. The results of causal influence analysis are applied to the formulation of mechanisms for assessing the level of trust in each organization as addressed below. 7.2.3 Approach for Intensive Modeling of Trust Elements’ Interrelationships-to Formulate Mechanisms for Assessing Trust Level of Organizations Considering the need for assessing the trust level of every organization in the VBE, a wide range of trust criteria may be considered for evaluating organizations’ trustworthiness. In the introduced approach, trust is characterized as a multi-objective, multiperspective and multi-criteria subject. As such, trust is not a single concept that can be applied to all cases for trust-based decision-making [47], and its measurement for each case depends on the purpose of establishing a trust relationship, the preferences of the VBE actor who constitutes the trustor in the case and the availability of trust related information from the VBE actor who constitutes the trustee in the case [18]. In this
32
H. Afsarmanesh et al.
respect, the trust level of an organization can be measured rationally in terms of the quantitative values available for related trust criteria e.g. rooted on past performance. Therefore from analytic modeling, formal mechanisms can be deduced for rational measurement of organizations’ trust level [18, 48]. These mechanisms are the formalized into mathematical equations resulted from causal influence analysis and interrelationships the between trust criteria, the known factors within the VBE, and the intermediate factors that are defined to link those causal relations. A causal model, as inspired from the discipline of systems engineering, supports the analysis of causal influence inter-relationships among measurable factors (trust criteria, known factors and intermediate factors) while it also supports modeling the nature of influences qualitatively [49]. For example, as shown in Fig. 9 while the factors “cash capital” and “capital” are measured quantitatively, the influence of the cash capital on the capital is qualitatively modeled as positive.
Fig. 9. A causal model of trust criteria associated with the economical perspective
Furthermore, applying techniques from systems engineering the formulation of mathematical equations applying causal models is thoroughly addressed in [50]. To exemplify the formulation of equations based on results of analysis and modeling of causal influences, below we present the equations for two intermediate factors of capital (CA) and financial acceptance (FA) (see Fig. 9): CA = CC + PC + MC
and
FA =
SC RS
Where CC represents cash capital, PC represents physical capita, MC represents material capital, SC represents standards complied, and RS represents required standards. 7.2.4 Approach for Development TrustMan Subsystem – Focused on Database Aspects TrustMan system is developed to support a number of different users in the VBE with dissimilar roles as well as rights which means different services and user interfaces are required for each user. As a part of the system analysis, all potential users of the TrustMan system were identified and classified into groups depending on their roles and rights on the VBE. Then, for each user group a set of functional requirements
Information Supporting Functional Dimension of Collaborative Networks
33
were identified, to be supported by the TrustMan system. The classified user groups of the TrustMan system include: the VBE administrator, the VO planner, the VBE member, the VBE membership applicant, the trust expert and the VBE guest. The identified user requirements and their specified services for the TrustMan system are addressed in [51]. Moreover, to enhance interoperability with other sub-systems in the VBE, the design of TrustMan system adopts the service-oriented architecture and specifically, the web service standards. In particular, the design of TrustMan system adapts the layering approach for classifying services. A well-designed architecture of TrustMan system based on the concepts of service oriented architecture is addressed in [36]. Focusing here only on the information management aspects of the TrustMan system, one important issue is related to the development of the schemas for the implementation of its required database. In order to enhance the interoperability and sharing of data that is managed by the TrustMan system with both the existing/legacy databases at different organizations as well as with other sub-systems of the VBE management system, the relational approach is adopted for the TrustMan database. More specifically, three schemas are developed to support the following: (1) general information related to trust elements, (2) general information about organizations, and (3) Specific trust related data of organizations. These are further defined below 1.
2.
3.
General information related to trust elements - This information constitutes a list and a set of descriptions of trust elements, namely of different trust perspectives, trust requirements, and trust criteria. General information about organizations - This refers to the information that is necessary to accurately describe each physical or virtual organization. For physical organizations, this information may constitute the name, legal registration details, address, and so on. For virtual organizations, this information may constitute, among others, the VO coordinator details, launching and dissolving dates, involved partners, and the customers. Specific trust related data for organizations - This information constitutes the values of trust criteria for each organization. This information represents primarily the organization’s performance data, expressed in terms of different trust criteria, and is used as the main input data for the services that assess the level of trust in each organization.
8 Conclusion A main challenging criterion for the success of collaborative networks is the effective management of the wide variety of information that needs to be handled inside the CNs to support their functional dimension. The paper defends that for efficient creation of dynamic opportunity-based collaborative networks, such as virtual organizations and virtual teams, complete and up-to-date information on wide variety of aspects are necessary. Research and practice have indicated that preestablishment of supporting long-term strategic alliances, can provide the needed environment for creation of cost and time effective VOs and VTs. While some manifestations of such strategic alliances already exist, their 2^nd generation needs
34
H. Afsarmanesh et al.
a much stronger management system, providing functionalities on top of enabling information management systems. This management system is shown to model, organize, and store partly the information gathered from the CN actors, and partly the information generated within the CN itself. The paper first addressed the main challenges of the CNs, while addressing their requirements for management of information. Furthermore, the paper focuses down on the strategic alliances and specifically on the management of the VBEs, in order to introduce the complexity of their needed functionality. Specific examples of information management challenges have been then addressed through the specification of three subsystems of the VBE management system, namely the subsystems handling the engineering of VBE Ontology, the profile and competency management in VBEs, and assessment and management of the rational trust in VBEs. As illustrated by these examples, collaborative networks raise quite complex challenges, requiring modeling and management of large amounts of heterogeneous and incomplete information, which require a combination of approaches such as distributed/federated databases, ontology engineering, computational intelligence and qualitative modeling.
References 1. Afsarmanesh, H., Camarinha-Matos, L.M.: The ARCON modeling framework. In: Collaborative networks reference modeling, pp. 67–82. Springer, New York (2008) 2. Afsarmanesh, H., Camarinha-Matos, L.M.: Towards a semi-typology for virtual organization breeding environments. In: COA 2007 – 8th IFAC Symposium on Cost-Oriented Automation, Habana, Cuba, vol. 8, part 1, pp. 22(1–12) (2007) 3. Camarinha-Matos, L.M., Afsarmanesh, H.: A comprehensive modeling framework for collaborative networked organizations. The Journal of Intelligent Manufacturing 18(5), 527– 615 (2007) 4. Katzy, B., Zang, C., Loh, H.: Reference models for virtual organizations. In: Virtual organizations – Systems and practices, pp. 45–58. Springer, Heidelberg (2005) 5. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaboration forms. In: Collaborative networks reference modeling, pp. 51–66. Springer, New York (2008) 6. Himmelman, A.T.: On coalitions and the transformation of power relations: collaborative betterment and collaborative empowerment. American journal of community psychology 29(2), 277–284 (2001) 7. Pollard, D.: Will that be coordination, cooperation or collaboration? Blog (March 25, 2005), http://blogs.salon.com/0002007/2005/03/25.html#a1090 8. Bamford, J., Ernst, D., Fubini, D.G.: Launching a World-Class Joint Venture. Harvard Business Review 82(2), 90–100 (2004) 9. Blomqvist, K., Hurmelinna, P., Seppänen, R.: Playing the collaboration game rightbalancing trust and contracting. Technovation 25(5), 497–504 (2005) 10. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaborative networks: A new scientific discipline. J. Intelligent Manufacturing 16(4-5), 439–452 (2005) 11. Afsarmanesh, H., Camarinha-Matos, L.M.: On the classification and management of virtual organization breeding environments. The International Journal of Information Technology and Management – IJITM 8(3), 234–259 (2009) 12. Giesen, G.: Creating collaboration: A process that works! Greg Giesen & Associates (2002)
Information Supporting Functional Dimension of Collaborative Networks
35
13. Ermilova, E., Afsarmanesh, H.: A unified ontology for VO Breeding Environments. In: Proceedings of DHMS 2008 - IEEE International Conference on Distributed HumanMachine Systems, Athens, Greece, pp. 176–181. Czech Technical University Publishing House (2008) ISBN: 978-80-01-04027-0 14. Rabelo, R.: Advanced collaborative business ICT infrastructure. In: Methods and Tools for collaborative networked organizations, pp. 337–370. Springer, New York (2008) 15. Abreu, A., Macedo, P., Camarinha-Matos, L.M.: Towards a methodology to measure the alignment of value systems in collaborative Networks. In: Azevedo, A. (ed.) Innovation in Manufacturing Networks, pp. 37–46. Springer, New York (2008) 16. Romero, D., Galeano, N., Molina, A.: VO breeding Environments Value Systems, Business Models and Governance Rules. In: Methods and Tools for collaborative networked organizations, pp. 69–90. Springer, New York (2008) 17. Rosas, J., Camarinha-Matos, L.M.: Modeling collaboration preparedness assesment. In: Collaborative networks reference modeling, pp. 227–252. Springer, New York (2008) 18. Msanjila, S.S., Afsarmanesh, H.: Trust Analysis and Assessment in Virtual Organizations Breeding Environments. The International Journal of Production Research 46(5), 1253– 1295 (2008) 19. Romero, D., Galeano, N., Molina, A.: A conceptual model for Virtual Breeding Environments Value Systems. In: Accepted for publication in Proceedings of PRO-VE 2007 - 8th IFIP Working Conference on Virtual Enterprises. Springer, Heidelberg (2007) 20. Winkler, R.: Keywords and Definitions Around “Collaboration”. SAP Design Guild, 5th edn. (2002) 21. Ermilova, E., Afsarmanesh, H.: Competency modeling targeted on promotion of organizations towards VO involvement. In: The proceedings of PRO-VE 2008 – 9th IFIP Working Conference on Virtual Enterprises, Poznan, Poland, pp. 3–14. Springer, Boston (2008) 22. Brna, P.: Models of collaboration. In: Proceedings of BCS 1998 - XVIII Congresso Nacional da Sociedade Brasileira de Computação, Belo Horizonte, Brazil (1998) 23. Oliveira, A.I., Camarinha-Matos, L.M.: Agreement negotiation wizard. In: Methods and Tools for collaborative networked organizations, pp. 191–218. Springer, New York (2008) 24. Wolff, T.: Collaborative Solutions – True Collaboration as the Most Productive Form of Exchange. In: Collaborative Solutions Newsletter. Tom Wolff & Associates (2005) 25. Kangas, S.: Spectrum Five: Competition vs. Cooperation. The long FAQ on Liberalism (2005), http://www.huppi.com/kangaroo/ LiberalFAQ.htm#Backspectrumfive 26. Afsarmanesh, H., Camarinha-Matos, L.M., Ermilova, E.: VBE reference framework. In: Methods and Tools for collaborative networked organizations, pp. 35–68. Springer, New York (2008) 27. Afsarmanesh, H., Msanjila, S.S., Ermilova, E., Wiesner, S., Woelfel, W., Seifert, M.: VBE management system. In: Methods and Tools for collaborative networked organizations, pp. 119–154. Springer, New York (2008) 28. Afsarmanesh, H., Camarinha-Matos, L.M.: Related work on reference modeling for collaborative networks. In: Collaborative networks reference modeling, pp. 15–28. Springer, New York (2008) 29. Tolle, M., Bernus, P., Vesterager, J.: Reference models for virtual enterprises. In: Camarinha-Matos, L.M. (ed.) Collaborative business ecosystems and virtual enterprises, Kluwer Academic Publishers, Boston (2002)
36
H. Afsarmanesh et al.
30. Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems Journal 26(3) (1987) 31. Camarinha-Matos, L.M., Afsarmanesh, H.: Emerging behavior in complex collaborative networks. In: Collaborative Networked Organizations - A research agenda for emerging business models, ch. 6.2. Kluwer Academic Publishers, Dordrecht (2004) 32. Shuman, J., Twombly, J.: Collaborative Network Management: An Emerging Role for Alliance Management. In: White Paper Series - Collaborative Business, vol. 6. The Rhythm of Business, Inc. (2008) 33. Afsarmanesh, H., Camarinha-Matos, L.M., Msanjila, S.S.: On Management of 2nd Generation Virtual Organizations Breeding Environments. The Journal of Annual Reviews in Control (in press, 2009) 34. Camarinha-Matos, L.M., Afsarmanesh, H.: A framework for Virtual Organization creation in a breeding environment. Int. Journal Annual Reviews in Control 31, 119–135 (2007) 35. Nieto, M.A.M.: An Overview of Ontologies, Technical report, Conacyt Projects No. 35804−A and G33009−A (2003) 36. Ollus, M.: Towards structuring the research on virtual organizations. In: Virtual Organizations: Systems and Practices. Springer Science, Berlin (2005) 37. Guevara-Masis, V., Afsarmanesh, H., Hetzberger, L.O.: Ontology-based automatic data structure generation for collaborative networks. In: Proceedings of 5th PRO-VE 2004 – Virtual Enterprises and Collaborative Networks, pp. 163–174. Kluwer Academic Publishers, Dordrecht (2004) 38. Anjewierden, A., Wielinga, B.J., Hoog, R., Kabel, S.: Task and domain ontologies for knowledge mapping in operational processes. Metis deliverable 2003/4.2. University of Amsterdam (2003) 39. Afsarmanesh, H., Ermilova, E.: Management of Ontology in VO Breeding Environments Domain. To appear in International Journal of Services and Operations Management – IJSOM, special issue on Modelling and Management of Knowledge in Collaborative Networks (2009) 40. Uschold, M., King, M., Moralee, S., Zorgios, Y.: The Enterprise Ontology. The Knowledge Engineering Review 13(1), 31–89 (1998) 41. Ding, Y., Fensel, D.: Ontology Library Systems: The key to successful Ontology Re-use. In: Proceedings of the First Semantic Web Working Symposium (2001) 42. Simoes, D., Ferreira, H., Soares, A.L.: Ontology Engineering in Virtual Breeding Environments. In: Proceedings of PRO-VE 2007 conference, pp. 137–146 (2007) 43. Ermilova, E., Afsarmanesh, H.: Modeling and management of Profiles and Competencies in VBEs. J. of Intelligent Manufacturing (2007) 44. Javidan, M.: Core Competence: What does it mean in practice? Long Range planning 31(1), 60–71 (1998) 45. Molina, A., Flores, M.: A Virtual Enterprise in Mexico: From Concepts to Practice. Journal of Intelligent and Robotics Systems 26, 289–302 (1999) 46. Msanjila, S.S., Afsarmanesh, H.: On Architectural Design of TrustMan System Applying HICI Analysis Results. The case of technological perspective in VBEs. The International Journal of Software 3(4), 17–30 (2008) 47. Castelfranchi, C., Falcone, R.: Trust Is Much More than Subjective Probability: Mental Components and Sources of Trust. In: Proceedings of the 33rd Hawaii International Conference on System Sciences (2000)
Information Supporting Functional Dimension of Collaborative Networks
37
48. Msanjila, S.S., Afsarmanesh, H.: Modeling Trust Relationships in Collaborative Networked Organizations. The International Journal of Technology Transfer and Commercialisation; Special issue: Data protection, Trust and Technology 6(1), 40–55 (2007) 49. Pearl, J.: Graphs, causality, and structural equation models. The Journal of Sociological Methods and Research 27(2), 226–264 (1998) 50. Byne, B.M.: Structural equation modeling with EQS: Basic concepts, Applications, and Programming, 2nd edn. Routlege/Academic (2006) 51. Msanjila, S.S., Afsarmanesh, H.: On development of TrustMan system assisting configuration of temporary consortiums. The International Journal of Production Research; Special issue: Virtual Enterprises – Methods and Approaches for Coalition Formation 47(17) (2009)
A Universal Metamodel and Its Dictionary Paolo Atzeni1 , Giorgio Gianforme2 , and Paolo Cappellari3 1
Universit` a Roma Tre, Italy
[email protected] 2 Universit` a Roma Tre, Italy
[email protected] 3 University of Alberta, Canada
[email protected]
Abstract. We discuss a universal metamodel aimed at the representation of schemas in a way that is at the same time model-independent (in the sense that it allows for a uniform representation of different data models) and model-aware (in the sense that it is possible to say to whether a schema is allowed for a data model). This metamodel can be the basis for the definition of a complete model-management system. Here we illustrate the details of the metamodel and the structure of a dictionary for its representation. Exemplifications of a concrete use of the dictionary are provided, by means of the representations of the main data models, such as relational, object-relational or XSD-based. Moreover, we demonstrate how set operators can be redefined with respect to our dictionary and easily applied on it. Finally, we show how such a dictionary can be exploited to automatically produce detailed descriptions of schema and data models, in a textual (i.e. XML) or visual (i.e. UML class diagram) way.
1
Introduction
Metadata is descriptive information about data and applications. Metadata is used to specify how data is represented, stored, and transformed, or may describe interfaces and behavior of software components. The use of metadata for data processing was reported as early as fifty years ago [22]. Since then, metadata-related tasks and applications have become truly pervasive and metadata management plays a major role in today’s information systems. In fact, the majority of information system problems involve the design, integration, and maintenance of complex application artifacts, such as application programs, databases, web sites, workflow scripts, object diagrams, and user interfaces. These artifacts are represented by means of formal descriptions, called schemas or models, and, consequently, metadata. Indeed, to solve these problems we have to deal with metadata, but it is well known that applications solving metadata manipulation are complex and hard to build, because of heterogeneity and impedance mismatch. Heterogeneity arises because data sources are independently developed by different people and for different purposes and subsequently need to be integrated. The data sources may use different data models, different schemas, and different value encodings. Impedance mismatch A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 38–62, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Universal Metamodel and Its Dictionary
39
arises because the logical schemas required by applications are different from the physical ones exposed by data sources. The manipulation includes designing mappings (which describe how two schemas are related to each other) between the schemas, generating a schema from another schema along with a mapping between them, modifying a schema or mapping, interpreting a mapping, and generating code from a mapping. In the past, these difficulties have always been tackled in practical settings by means of ad-hoc solutions, for example by writing a program for each specific application. This is clearly very expensive, as it is laborious and hard to maintain. In order to simplify such manipulation, Bernstein et al. [11,12,23] proposed the idea of a model management system. Its goal is to factor out the similarities of the metadata problems studied in the literature and develop a set of high-level operators that can be utilized in various scenarios. Within such a system, we can treat schemas and mappings as abstractions that can be manipulated by operators that are meant to be generic in the sense that a single implementation of them is applicable to all of the data models. Incidentally, let us remark that in this paper we use the terms “schema” and “data model” as common in the database literature, though some model-management literature follows a different terminology (and uses “model” instead of “schema” and “metamodel” instead of “data model”). The availability of a uniform and generic description of data models is a prerequisite for designing a model management system. In this paper we discuss a “universal metamodel” (called the supermodel ), defined by means of metadata and designed to properly represent “any” possible data model, together with the structure of a dictionary for storing such metadata. There are many proposals for dictionary structure in the literature. The use of dictionaries to handle metadata has been popular since the early database systems of the 1970’s, initially in systems that were external to those handling the database (see Allen et al. [1] for an early survey). With the advent of relational systems in the 1980’s, it became possible to have dictionaries be part of the database itself, within the same model. Today, all DBMSs have such a component. Extensive discussion was also carried out in even more general frameworks, with proposals for various kinds of dictionaries, describing various features of systems (see for example [9,19,21]) within the context of industrial CASE tools and research proposals. More recently, a number of metadata repositories have been developed [26]. They generally use relational databases for handling the information of interest. There are other significant recent efforts towards the description of multiple models, including the Model Driven Architecture (MDA) and, within it, the Common Warehouse Metamodel (CWM) [27], and Microsoft Repository [10]; in contrast to our approach, these do not distinguish metalevels, as the various models of interest are all specializations of a most general one, UML based. The description of models in terms of the (meta-)constructs of a metamodel was proposed by Atzeni and Torlone [8]. But it used a sophisticated graph language, which was hard to implement. The other papers that followed the same or similar approaches [14,15,16,28] also used specific structures.
40
P. Atzeni, G. Gianforme, and P. Cappellari
We know of no literature that describes a dictionary that exposes schemas in both model-specific and model-independent ways, together with a description of models. Only portions of similar dictionaries have been proposed. None of them offer the rich interrelated structure we have here. The contributions of this paper and its organization are the following. In Section 2 we briefly recall the metamodel approach we follow (based on the initial idea by Atzeni and Torlone [8]). In Section 3 we illustrate the organization of the dictionary we use to store our schemas and models, refining the presentation given in a previous conference paper (Atzeni et al. [3]). In Section 4 we illustrate a specific supermodel used to generalize a large set of models, some of which are also commented upon. Then, in Section 5 we discuss how some interesting operations on schemas can be specified and implemented on the basis of our approach. Section 6 is devoted to the illustration of generic reporting and visualization tools built out of the principles and structure of our dictionary. Finally, in Section 7 we summarize our results.
2
Towards a Universal Metamodel
In this section we summarize the overall approach towards a model-independent and model-aware representation of data models, based on an initial idea by Atzeni and Torlone [8]. The first step toward a uniform solution is the adoption of a general model to properly represent many different data models (e.g. entity-relationship, objectoriented, relational, object-relational, XML). The proposed general model is based on the idea of construct: a construct represents a “structural” concept of a data model. We find out a construct for each “structural” concept of every considered data model and, hence, a data model is completely represented by the set of its constructs. Let us consider two popular data models, entity-relationship (ER) and object-oriented (OO). Indeed, each of them is not “a model,” but “a family of models,” as there are many different proposals for each of them: OO with or without keys, binary and n-ary ER models, OO and ER with or without inheritance, and so on. “Structural” concepts for these data models are, for example, entity, attribute of entity, and binary relationship for the ER and class, field, and reference for the OO. Moreover, constructs have a name, may have properties and are related to one another. A UML class diagram of this construct-based representation of a simple ER model with entities, attributes of entities and binary relationships is depicted in Figure 1. Construct Entity has no attribute and no reference; construct AttributeOfEntity has a boolean property to specify whether an attribute is part of the identifier of the entity it belongs to and a property type to specify the data type of the attribute itself; construct BinaryRelationship has two references toward the entities involved in the relationship and several properties to specify role, minimum and maximum cardinalities of the involved entities, and whether the first entity is externally identified by the relationship itself.
A Universal Metamodel and Its Dictionary
41
Fig. 1. A simple entity-relationship model
Fig. 2. A simple object-oriented model
With similar considerations about a simple OO model with classes, simple fields (i.e. with standard type) and reference fields (i.e. a reference from a class to another) we obtain the UML class diagram of Figure 2. Construct Class has no attribute and no reference; construct Field is similar to AttributeOfEntity but it does not have boolean attributes, assuming that we do not want to manage explicit identifiers of objects; construct ReferenceField has two references toward the class owner of the reference and the class pointed by the reference itself. In this way, we have uniform representations of models (in terms of constructs) but these representations are not general. This is unfeasible as the number of (variants of) models grows because it implies a corresponding rise in the number of constructs. To overcome this limit, we exploit an observation of Hull and King [20], drawn on later by Atzeni and Torlone [7]: most known models have constructs that can be classified according to a rather small set of generic (i.e. model independent) metaconstructs: lexical, abstract, aggregation, generalization, and function. Recalling our example, entities and classes play the same role (or, in other terms, “they have the same meaning”), and so we can define a generic metaconstruct, called Abstract, to represent both these concepts; the same happens for attributes of entities and of relationships and fields of classes, representable by means of a metaconstructs called Lexical. Conversely, relationships and references do not have the same meaning and hence one metaconstruct is not enough to properly represent both concepts (hence BinaryAggregationOfAbstracts and AbstractAttribute are both included). Hence, each model is defined by its constructs and the metaconstructs they refer to. This representation is clearly at the same time model-independent (in
42
P. Atzeni, G. Gianforme, and P. Cappellari
the sense that it allows for a uniform representation of different data models) and model-aware (in the sense that it is possible to say to whether a schema is allowed for a data model). An even more important notion is that of supermodel (also called universal metamodel in the literature [13,24]): it is a model that has a construct for each metaconstruct, in the most general version. Therefore, each model can be seen as a specialization of the supermodel, except for renaming of constructs. A conceptual view of the essentials of this idea is shown in Figure 3: the supermodel portion is predefined, but can be extended (and we will present our recent extension later in this paper), whereas models are defined by specifying their respective constructs, each of which refers to a construct of the supermodel (SMConstruct) and so to a metaconstruct. It is important to observe that our approach is independent of the specific supermodel that is adopted, as new metaconstructs and so SM-Constructs can be added. This allows us to show simplified examples for the set of constructs, without losing the generality of the approach. In this scenario, a schema for a certain model is a set of instances of constructs allowed in that model. Let us consider the simple ER schema depicted in Figure 4. Its construct-based representation would include two instances of Entity (i.e. Employee and Project), one instance of BinaryRelationship (i.e. Membership) and four instances of AttributeOfEntity (i.e. EN, Name, Code, and Name). The model-independent representation (i.e. based on metaconstructs) would include two instances of Abstract, one instance of BinaryAggregationOfAbstracts and four instances of Lexical. For each of these instances we have to specify values for its attributes and references, meaningful for the model. So for example, the instance of Lexical corresponding to EN would refer to the instance of Abstract of employee through its abstractOID reference and would have a ‘true’ value for its isIdentifier attribute. This example is illustrated in Figure 5, where we omit not relevant properties, represent references only by means of arrows, and represent links between constructs and their instances by means of dashed arrows. In the same way, we can state that a database for a certain schema is a set of instances of constructs of that schema.
Fig. 3. A simplified conceptual view of models and constructs
Fig. 4. A simple entity-relationship schema
A Universal Metamodel and Its Dictionary
43
Fig. 5. A construct based representation of the schema of Figure 4
As a second example, let us consider the simple OO schema depicted in Figure 6. Its construct-based representation would include two instances of Class (i.e. Employee and Department), one instance of ReferenceField (i.e. Membership) and five instances of Field (i.e. EmpNo, Name, Salary, DeptNo, and SeptName). Alternatively, the model independent representation (i.e. based on metaconstructs) would include two instances of Abstract, one instance of AbstractAttribute and five instances of Lexical. On the other side, it is possible to use the same approach based on “concepts of interest” in order to obtain a high-level description of the supermodel (i.e. of the whole set of metaconstructs). From this point of view the concepts of interest are three: construct, construct property and construct reference. In this way we have a full description of the supermodel, with constructs, properties and references, as follows. Each construct has a name and a boolean attribute (isLexical ) that
44
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 6. A simple object-oriented schema
Fig. 7. A description of the supermodel
specifies whether its instances have actual, elementary values associated with (for example, this property would be true for AttributeOfAbstract and false for Abstract ). Each property belongs to a construct and has a name and a type. Each reference relates two constructs and has a name. A UML class diagram of this representation is presented in Figure 7.
3
A Multilevel Dictionary
The conceptual approach to the description of models and schemas presented in Section 2, despite being very useful to introduce the approach, is not effective to actually store data and metadata. Therefore, we have developed a relational implementation of the idea, leading to a multilevel dictionary organized in four parts, which can be characterized along two coordinates: the first corresponding to whether they describe models or schemas and the second depending on whether they refer to specific models or to the supermodel. This is represented in Figure 8.
6 model descriptions (the “metalevel”)
metamodels (mM)
meta-supermodel (mSM)
schema descriptions
models (M)
supermodel (SM)
model specific
model generic
Fig. 8. The four parts of the dictionary
-
A Universal Metamodel and Its Dictionary
45
The various portions of the dictionary correspond to various UML class diagrams of Section 2. In the rest of this section, we comment on them in detail. The meta-supermodel part of the dictionary describes the supermodel, that is, the set of constructs used for building schemas of various models. It is composed of three relations (whose names begin with MSM to recall that we are in the meta-supermodel portion), one for each “class” of the diagram of Figure 7. Every relation has one OID column and one column for each attribute and reference of the corresponding “class” of such a diagram. The relations of this part of the dictionary, with some of the data, are depicted in Figure 9. It is worth noting that these relations are rather small, because of the limited number of constructs in our supermodel. The metamodels part of the dictionary describes the individual models, that is, the set of specific constructs allowed in the various models, each one corresponding to a construct of the supermodel. It has the same structure as the meta-supermodel part with two differences: first, each relation has an extra column containing a reference towards the corresponding element of the supermodel (i.e. of the meta-supermodel part of the dictionary); second, there is an extra relation to store the names of the specific models and an extra column in the Construct relation referring to this extra relation. The relations of this part of the dictionary, with some of the data, are depicted in Figure 10. We refer to these first two parts as the “metalevel” of the dictionary, as it contains the description of the structure of the lower level, whose content describes schemas. The lower level is also composed of two parts, one referring to the supermodel constructs (therefore called the SM part) and the other to model-specific constructs (the M part). The structure of the schema level is, in
OID mc1 mc2 mc3 mc4 mc5 ...
OID mp1 mp2 mp3 mp4 mp5 ...
MSM Construct Name IsLexical Abstract false Lexical true Aggregation false BinaryAggregationOfAbstracts false AbstractAttribute false ... ...
MSM Property Name Construct Name mc1 Name mc2 IsIdentifier mc2 IsOptional mc2 Type mc2 ... ...
Type string string bool bool string ...
OID mr1 mr2 mr3 mr4 mr5 mr6 ...
MSM Reference Name Construct ConstructTo Abstract mc2 mc1 Aggregation mc2 mc3 Abstract1 mc4 mc1 Abstract2 mc4 mc1 Abstract mc5 mc1 AbstractTo mc5 mc1 ... ... ...
Fig. 9. The mSM part of the dictionary
46
P. Atzeni, G. Gianforme, and P. Cappellari
MM Model OID Name m1 ER m2 OODB
OID pr1 pr2 pr3 pr4 pr5 ... pr6 pr7 ...
MM Construct OID Name Model MSM-Constr. IsLexical co1 Entity m1 mc1 false co2 AttributeOfEntity m1 mc2 true co3 BinaryRelationship m1 mc4 false co4 Class m2 mc1 false co5 Field m2 mc2 true co6 ReferenceField m2 mc5 false
MM Property Name Constr. Type MSM-Prop. Name co1 string mp1 Name co2 string mp2 IsKey co2 bool mp3 Name co3 string ... IsOpt.1 co3 bool ... ... ... ... ... Name co4 string mp1 Name co5 string mp2 ... ... ... ...
OID ref1 ref2 ref3 ref4 ref5 ref6
MM Reference Name Constr. Constr.To MSM-Ref. Entity co2 co1 mr1 Entity co3 co1 mr3 Entity co3 co1 mr4 Class co5 co4 mr1 Class co6 co4 mr5 ClassTo co6 co4 mr6
Fig. 10. The mM part of the dictionary
our system, automatically generated out of the content of the metalevel: so, we can say that the dictionary is self-generating out of a small core. In detail, in the model part there is one relation for each row of MM Construct relation. Hence each of these relations corresponds to a construct and has, besides an OID column, one column for each property and reference specified for that construct in relations MM Property and MM Reference, respectively. Moreover, there is a relation schema to store the name of the schemas stored in the dictionary and each relation has an extra column referring to it. Hence, in practice, there is a set of relations for each specific model, with one relation for each construct allowed in the model. This portion of the dictionary is depicted in Figure 11, where we show the data for the schemas of Figures 4 and 6. Analogously, in the supermodel part there is one relation for each row of MSM Construct relation; hence each one of these relations corresponds to a metaconstruct (or a construct of the supermodel) and has, besides an OID column, one column for each property and reference specified for that metaconstruct in relations MSM Property and MSM Reference, respectively. Again, there is a relation schema to store the name of the schemas stored in the dictionary and each relation has an extra column referring to it. Moreover, the Schema relation has an extra column referring to the specific model each schema belongs to. This portion of the dictionary is depicted in Figure 12, where we show the data for the schemas of Figures 4 and 6, and hence we show the same data presented in Figure 11. It is worth noting that Abstract contains the same data as ER-Entity and OOClass taken together. Similarly, AttributeOfAbstract contains data in ERAttributeOfEntity and OO-Field.
A Universal Metamodel and Its Dictionary
Schema OID Name s1 ER Schema s2 OO Schema
ER-Entity OID Name Schema e1 Employee s1 e2 Project s1
ER-AttributeOfEntity OID Entity Name Type isKey Schema a1 e1 EN int true s1 a2 e1 Name string false s1 a3 e2 Code int true s1 a4 e2 Name string false s1
ER-BinaryRelationship OID Name IsOptional1 IsFunctional1 . . . Entity1 Entity2 Schema r1 Membership false false ... e1 e2 s1 OO-Class OID Name Schema cl1 Employee s2 cl2 Department s2 OO-ReferenceField OID Name Class ClassTo Schema ref1 Membership cl1 cl2 s2
OID f1 f2 f3 f4 f5
OO-Field Class Name Type Schema cl1 EmpNo int s2 cl1 Name string s2 cl1 Salary int s2 cl2 DeptNo int s2 cl2 DeptName string s2
Fig. 11. The dictionary for schemas of specific models
Schema OID Name Model s1 ER Schema m1 s2 OO Schema m2 Abstract OID Name Schema e1 Employee s1 e2 Project s1 cl1 Employee s2 cl2 Department s2
47
Lexical OID Abstract Name Type IsIdentifier Schema a1 e1 EN int true s1 a2 e1 Name string false s1 a3 e2 Code int true s1 a4 e2 Name string false s1 f1 cl1 EmpNo int ? s2 f2 cl1 Name string ? s2 f3 cl1 Salary int ? s2 f4 cl2 DeptNo int ? s2 f5 cl2 DeptName string ? s2
AbstractAttribute OID Name Abstract AbstractTo Schema ref1 Membership cl1 cl2 s2 BinaryRelationship OID Name IsOptional1 IsFunctional1 . . . Entity1 Entity2 Schema r1 Membership false false ... e1 e2 s1 Fig. 12. A portion of the SM part of the dictionary
48
4
P. Atzeni, G. Gianforme, and P. Cappellari
A Significant Supermodel with Models of Interest
As we said, our approach is fully extensible: it is possible to add new metaconstructs to represent new data models, as well as to refine and increase precision of actual representations of models. The supermodel we have mainly experimented with so far is a supermodel for database models and covers a reasonable family of them. If models were more detailed (as is the case for a fully-fledged XSD model) then the supermodel would be more complex. Moreover, other supermodels can be used in different contexts: we have had preliminary experiences with Semantic Web models [5,6,18], with the management of annotations [25], and with adaptive systems [17]. In this section we discuss in detail our actual supermodel. We describe all the metaconstructs of the supermodel, describing which concepts they represent, and how they can be used to properly represent several well known data models. A complete description of all the metaconstructs follows: Abstract - Any autonomous concept of the scenario. Aggregation - A collection of elements with heterogeneous components. It make no sense without its components. StructOfAttributes - A structured element of an Aggregation, an Abstract, or another StructOfAttributes. It could be not always present (isOptional ) and/or admit null values (isNullable). It could be multivalued or not (isSet ). AbstractAttribute - A reference towards an Abstract that could admit null values (isNullable). The reference may originate from an Abstract, an Aggregation, or a StructOfAttributes. Generalization - It is a “structural” construct stating that an Abstract is a root of a hierarchy, possibly total (isTotal ). ChildOfGeneralization - Another “structural” construct, related to the previous one (it can not be used without Generalization). It is used to specify that an Abstract is leaf of a hierarchy. Nest - It is a “structural” construct used to specify nesting relationship between StructOfAttributes. BinaryAggregationOfAbstracts - Any binary correspondence between (two) Abstract s. It is possible to specify optionality (isOptional1/2 ) and functionality (isFunctional1/2 ) of the involved Abstract s as well as their role (role1/2 ) or whether one of the Abstract s is identified in some way by such a correspondence (isIdentified ). AggregationOfAbstracts - Any n-ary correspondence between two or more Abstract s. ComponentOfAggregationOfAbstracts - It states that an Abstract is one of those involved in an AggregationOfAbstracts (and hence can not be used without AggregationOfAbstracts). It is possible to specify optionality (isOptional1/2 ) and functionality (isFunctional1/2 ) of the involved Abstract as well as whether the Abstract is identified in some way by such a correspondence (isIdentified ).
A Universal Metamodel and Its Dictionary
49
Lexical - Any lexical value useful to specify features of Abstract, Aggregation, StructOfAttributes, AggregationOfAbstracts, or BinaryAggregationOfAbstracts. It is a typed attribute (type) that could admit null values, be optional, and identifier of the object it refers to (the latter is not applicable to Lexical of StructOfAttributes, BinaryAggregationOfAbstracts, and AggregationOfAbstracts). ForeignKey - It is a “structural” construct stating the existence of some kind of referential integrity constraints between Abstract, Aggregation and/or StructOfAttributes, in every possible combination. ComponentOfForeignKey - Another “structural” construct, related to the previous one (it can not be used without ForeignKey). It is used to specify which are the Lexical attributes involved (i.e. referring and referred) in a referential integrity constraint. A UML class diagram of these (meta)constructs is presented in Figure 13.
Fig. 13. The Supermodel
We summarize constructs and (families of) models in Figure 14, where we show a matrix, whose rows correspond to the constructs and columns to the families we have experimented with. In the cells, we use the specific name used for the construct in the family (for example, Abstract is called Entity in the ER model). The various models within
50
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 14. Constructs and models
a family differ from one another (i) on the basis of the presence or absence of specific constructs and (ii) on the basis of details of (constraints on) them. To give an example for (i) let us recall that versions of the ER model could have generalizations, or not have them, and the OR model could have structured columns or just simple ones. For (ii) we can just mention again the various restrictions on relationships in the binary ER model (general vs. one-to-many), which can be specified by means of constraints on the properties. It is also worth mentioning that a given construct can be used in different ways (again, on the basis of conditions on the properties) in different families: for example, a structured attribute could be multivalued, or not, on the basis of the value of a property isSet. The remainder of this section is devoted to a detailed description of the various models. 4.1
Relational
We consider a relational model with tables composed of columns of a specified type; each column could allow null value or be part of the primary key of the table. Moreover we can specify foreign keys between tables involving one or more columns. Figure 15 shows a UML class diagram of the constructs allowed in the relational model with the following correspondences: Table - Aggregation. Column - Lexical. We can specify the data type of the column (type) and whether it is part of the primary key (isIdentifier ) or it allows null value (isNullable). It has a reference toward an Aggregation.
A Universal Metamodel and Its Dictionary
51
Fig. 15. The Relational model
Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct (referencing two Aggregations) we specify the existence of a foreign key between two tables; with the second construct (referencing one ForeignKey and two Lexical s) we specify the columns involved in a foreign key. 4.2
Binary ER
We consider a binary ER model with entities and relationships together with their attributes and generalizations (total or not). Each attribute could be optional or part of the identifier of an entity. For each relationship we specify minimum and maximum cardinality and whether an entity is externally identified by it. Figure 16 shows a UML class diagram of the constructs allowed in the model with the following correspondences: Entity - Abstract. Attribute of Entity - Lexical. We can specify the data type of the attribute (type) and whether it is part of the identifier (isIdentifier ) or it is optional (isOptional ). It refers to an Abstract. Relationship - BinaryAggregationOfAbstracts. We can specify minimum (0 or 1 with the property isOptional ) and maximum (1 or N with the property isFunctional ) cardinality of the involved entities (referenced by the construct). Moreover we can specify the role (role) of the involved entities and whether the first entity is externally identified by the relationship (IsIdentified ). Attribute of Relationship - Lexical. We can specify the data type of the attribute (type) and whether it is optional (isOptional ). It refers to a BinaryAggregationOfAbstracts Generalization - Generalization and ChildOfGeneralization. With the first construct (referencing an Abstract ) we specify the existence of a generalization
52
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 16. The binary ER model
rooted in the referenced Entity; with the second construct (referencing one Generalization and one Abstract ) we specify the childs of the generalization. We can specify whether the generalization is total or not (isTotal ).
4.3
N-Ary ER
We consider an n-ary ER model with the same features of the aforementioned binary ER. Figure 17 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Entity - Abstract. Attribute of Entity - Lexical. Relationship - AggregationOfAbstracts and ComponentOfAggregationOfAbstracts. With the first construct we specify the existence of a relationship; with the second construct (referencing an AggregationOfAbstracts and an Abstract ) we specify the entities involved in such relationship. We can specify minimum (0 or 1 with the property isOptional ) and maximum (1 or N with the property isFunctional ) cardinality of the involved entities. Moreover we can specify whether an entity is externally identified by the relationship (IsIdentified ).
A Universal Metamodel and Its Dictionary
53
Fig. 17. The n-ary ER model
Attribute of Relationship - Lexical. It refers to an AggregationOfAbstracts. Generalization - Generalization and ChildOfGeneralization.
4.4
Object-Oriented
We consider an Object-Oriented model with classes, simple and reference fields. We can also specify generalizations of classes. Figure 18 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Class - Abstract. Field - Lexical. Reference Field - AbstractAttribute. It has two references toward the referencing Abstract and the referenced one. Generalization - Generalization and ChildOfGeneralization.
4.5
Object-Relational
We consider a simplified version of the Object-Relational model. We merge the constructs of our Relational and OO model, where we have typed-tables rather
54
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 18. The OO model
than classes. Moreover we consider structured columns of tables (typed or not) that can be nested. Reference columns must be toward a typed table but can be part of a table (typed or not) or of a structured column. Foreign keys can involve also typed tables and structured columns. Finally, we can specify generalizations that can involve only typed tables. Figure 19 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Table - Aggregation. Typed Table - Abstract. Structured Column - StructOfAttributes and Nest. The structured column, represented by a StructOfAttributes can allow null values or not (isNullable) and can be part of a simple table or of a typed table (this is specified by its references toward Abstract and Aggregation. We can specify nesting relationships between structured columns by means of Nest, that has two references toward the top StructOfAttributes and the nested one. Column - Lexical. It can be part of (i.e. refer to) a simple table, a typed table or a structured column. Reference Column - AbstractAttribute. It may be part of a table (typed or not) and of a structured column (specified by a reference) and must refer to a typed table (i.e. it has a reference toward an Abstract ).
A Universal Metamodel and Its Dictionary
55
Fig. 19. The Object-Relational model
Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct (referencing two tables, typed or not, and a structured column) we specify the existence of a foreign key between tables (typed or not) and structured column; with the second construct (referencing one ForeignKey and two Lexical s) we specify the columns involved in a foreign key. Generalization - Generalization and ChildOfGeneralization.
4.6
XSD as a Data Model
XSD is a very powerful technique for organizing documents and data, described by a very long specification. We consider a simplified version of the XSD language. We are only interested in documents that can be used to store large amount of data. Indeed we consider documents with at least one top element unbounded. Then we deal with elements that can be simple or complex (i.e. structured). For these elements we can specify whether they are optional or whether they can be null (nillable according to the syntax and terminology of XSD). Simple elements could be part of the key of the element they belong to and have an associated type. Moreover we allow the definition of foreign keys (key and keyref according to XSD terminology). Clearly, this representation is highly simplified but, as we said, it could be extended with other features if there were interest in them.
56
P. Atzeni, G. Gianforme, and P. Cappellari
Figure 20 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained):
Fig. 20. The XSD language
Root Element - Abstract. Complex Element - StructOfAttributes and Nest. The first construct represent structured elements that can be unbounded or not (isSet ), can allow null values or not (isNullable) and can be optional (isOptional ). We can specify nesting relationships between complex elements by means of Nest, that has two references toward the top StructOfAttributes and the nested one. Simple Element - Lexical. It can be part of (i.e. refer to) a root element or a complex one. Foreign Key - ForeignKey and ComponentOfForeignKey.
5
Operators over Schema and Models
The model-independent and model-aware representation of data models and schemas can be the basis for many fruitful applications. Our fist major application has been the development of a model-independent approach for schema and data translation [4] (a generic implementation of the modelgen operator, according to Bernstein’s model management [11]). We are currently working on additional applications, towards a more general model management system [2], the most interesting of which is related to set operators (i.e. union, difference, intersection). In this section we discuss the redefinition of these operators against our construct-based representation. Let us concentrate on models first. The starting point is clearly the definition of an equality function between constructs. Two
A Universal Metamodel and Its Dictionary
57
constructs belonging to two models are equal if and only if they correspond to the same metaconstruct, have the same properties with the same values, and, if they have references, they have the same references with the same values (i.e. the same number of references, towards constructs proved to be equal). Two main observations are needed. First, we can refer to supermodel constructs without loss of generality, as every construct of every specific model corresponds to a (meta)construct of the supermodel, as we said in Section 2. Second, the definition is recursive but well defined as well, since the graph of the supermodel (i.e. with constructs as nodes and references between constructs as edges) is acyclic; this implies that a partial order on the constructs can be found, and all the equality check between constructs can be performed traversing the graph accordingly to such a partial order. The union of two models is trivial, as we have simply to include in the result the constructs of both the involved models. For difference and intersection, we need the aforementioned definition of equality between constructs. When one of these operators is applied, for each construct of the first model, we look for an equal construct in the second model. If the operator is the difference, the result is composed by all the constructs of the first model that has not an equal construct in the second model; if the operator is the intersection, the result is composed only by the constructs of the first model that has an equal construct in the second model. A very similar approach can be followed for set operators on schemas, which are usually called merge and diff [11], but we can implement in terms of union and difference, provided they are supported by a suitable notion of equivalence. Some care is needed to consider details, but the basic idea is that the operators can be implemented by executing the set operations on the constructs of the various types, where the metalevel is used to see which are the involved types, those that are used in the model at hand.
6
Reporting
In this section we focus on another interesting application of our approach, namely the possibility of producing reports for models and schemas, again in a manner that is both model-independent and model-aware. Reports can be rendered as detailed textual documentations of data organization, in a readable and machineprocessable way, or as diagrams in a graphical user interface. Again, this is possible because of the supermodel: we visualize supermodel constructs together with their properties, and relate them each other by means of their references. In this way, we could obtain a “flat” report of a model, which does not distinguish between type of references; so, for example, the references between a ForeignKey and the two Aggregations involved in it would be represented as a reference from a Lexical towards an Abstract. This is clearly not satisfactory. The core idea is to classify the references in two classes: strong and weak. Instances of constructs related by means of a strong reference (e.g. an Abstract with its Lexical s) are presented together, while those having a weak relationship (e.g.
58
P. Atzeni, G. Gianforme, and P. Cappellari
a ForeignKey with the Aggregations involved in it) are presented in different elements. In rendering reports as text, we adopt the XML format. The main advantage of XML reports is that they are both self-documenting and machine processable if needed. Constructs and their instances can be presented according to a partial order on the constructs that can be found since, as we already said in the previous section, the graph of the supermodel (i.e. with constructs as nodes and references between constructs as edges) is acyclic. As we said in Section 2, a schema (as well as a model) is completely represented by the set of its constructs. Hence, a report for a schema would include a set of construct elements. In order to produce a report for a schema S we can consider its constructs following a total order, C1 , C2 , ..., Cn , for supermodel constructs (obtained serializing a partial order of them). For each construct Ci , we consider its occurrences in S, and for each of them not yet inserted in the report, we add a construct element named Ci with all its properties as XML attributes. Let us consider an occurrence oij of Ci . If oij is pointed by any strong reference, we add a set of component elements nested in the corresponding construct element: the set would have a component element for each occurrence of a construct with a strong reference toward oij . If oij has any weak reference towards another occurrence of a construct, we add a set of reference elements: each element of this set correspond to a weak reference and has OID and name properties of the pointed occurrence as XML attributes. As an example, the textual report of the ER schema of figure 4 would be as follows: <schema name="ERsimple" model="binaryER">
<ER-Entity OID="e1" name="Employee"> <ER-AttributeOfEntity OID="a1" name="EN" isKey="true" type="int"> <ER-AttributeOfEntity OID="a2" name="Name" isKey="false" type="string"> <ER-Entity OID="e2" name="Project"> ... <ER-BinaryRelationship OID="r1" name="Membership" isOptional1="false" isFunctional1"false" ...\> <entity1 OID="e1" name="Employee"/> <entity2 OID="e2" name="Project"/>
A Universal Metamodel and Its Dictionary
59
As we already said, a second option for report rendering is through a visual graph. A few examples, for different models are shown in Figures 21, 22, and 23.
Fig. 21. An ER schema
Fig. 22. An OO schema
60
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 23. An XML-Schema
The rationale is the same as for textual reports: – visualization is model independent as it is defined for all schemas of all models in the same way: strong references lead to embedding the “component” construct within the “containing” one, whereas weak references lead to separate graphical objects, connected by means of arrows; – visualization is model aware, in two sense: first of all, as usual the specific features of each model are taken into account; second, and more important, for each family of models it is possible to associate a specific shape with each construct, thus following the usual representation for the model (see for example the usual notation for relationships in the ER model in Figure 21. An extra feature of the graphical visualization is the possibility to represent instances of schemas also by means of a “relational” representation that follows straightforward our construct-based modeling.
7
Conclusions
We have shown how a metamodel approach can be a the basis for a number model-generic and model-aware techniques for the solution of interesting problems. We have shown a dictionary we use to store our schemas and models, a specific supermodel (a data model that generalizes all models of interest modulo construct renaming). This is the bases for the specification and implementation of interesting high-level operations, such as schema translation as well as
A Universal Metamodel and Its Dictionary
61
set-theoretic union and difference. Another interesting application is the development of generic visualization and reporting features.
Acknowledgement We would like to thank Phil Bernstein for many useful discussions during the preliminary development of this work.
References 1. Allen, F.W., Loomis, M.E.S., Mannino, M.V.: The integrated dictionary/directory system. ACM Comput. Surv. 14(2), 245–286 (1982) 2. Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: From schema and model translation to a model management system. In: Gray, A., Jeffery, K., Shao, J. (eds.) BNCOD 2008. LNCS, vol. 5071, pp. 227–240. Springer, Heidelberg (2008) 3. Atzeni, P., Cappellari, P., Bernstein, P.A.: A multilevel dictionary for model man´ agement. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, O. (eds.) ER 2005. LNCS, vol. 3716, pp. 160–175. Springer, Heidelberg (2005) 4. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P.A., Gianforme, G.: Modelindependent schema translation. VLDB J. 17(6), 1347–1370 (2008) 5. Atzeni, P., Del Nostro, P.: Management of heterogeneity in the semantic web. In: ICDE Workshops, p. 60. IEEE Computer Society, Los Alamitos (2006) 6. Atzeni, P., Paolozzi, S., Nostro, P.D.: Ontologies and databases: Going back and forth. In: ODBIS (VLDB Workshop), pp. 9–16 (2008) 7. Atzeni, P., Torlone, R.: A metamodel approach for the management of multiple models and translation of schemes. Information Systems 18(6), 349–362 (1993) 8. Atzeni, P., Torlone, R.: Management of multiple models in an extensible database design tool. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 79–95. Springer, Heidelberg (1996) 9. Batini, C., Battista, G.D., Santucci, G.: Structuring primitives for a dictionary of entity relationship data schemas. IEEE Trans. Software Eng. 19(4), 344–365 (1993) 10. Bernstein, P., Bergstraesser, T., Carlson, J., Pal, S., Sanders, P., Shutt, D.: Microsoft repository version 2 and the open information model. Information Systems 22(4), 71–98 (1999) 11. Bernstein, P.A.: Applying model management to classical meta data problems. In: CIDR Conference, pp. 209–220 (2003) 12. Bernstein, P.A., Halevy, A.Y., Pottinger, R.: A vision of management of complex models. SIGMOD Record 29(4), 55–63 (2000) 13. Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007) 14. B´ezivin, J., Breton, E., Dup´e, G., Valduriez, P.: The ATL transformation-based model management framework. Research Report 03.08, IRIN, Universit´e de Nantes (2003) 15. Claypool, K.T., Rundensteiner, E.A.: Sangam: A framework for modeling heterogeneous database transformations. In: ICEIS (1), pp. 219–224 (2003) 16. Claypool, K.T., Rundensteiner, E.A., Zhang, X., Su, H., Kuno, H.A., Lee, W.-C., Mitchell, G.: Sangam - a solution to support multiple data models, their mappings and maintenance. In: SIGMOD Conference, p. 606 (2001)
62
P. Atzeni, G. Gianforme, and P. Cappellari
17. De Virgilio, R., Torlone, R.: Modeling heterogeneous context information in adaptive web based applications. In: ICWE Conference, pp. 56–63. ACM, New York (2006) 18. Gianforme, G., Virgilio, R.D., Paolozzi, S., Nostro, P.D., Avola, D.: A novel approach for practical semantic web data management. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178, pp. 650–655. Springer, Heidelberg (2008) 19. Hsu, C., Bouziane, M., Rattner, L., Yee, L.: Information resources management in heterogeneous, distributed environments: A metadatabase approach. IEEE Trans. Software Eng. 17(6), 604–625 (1991) 20. Hull, R., King, R.: Semantic database modelling: Survey, applications and research issues. ACM Computing Surveys 19(3), 201–260 (1987) 21. Kahn, B.K., Lumsden, E.W.: A user-oriented framework for data dictionary systems. DATA BASE 15(1), 28–36 (1983) 22. McGee, W.C.: Generalization: Key to successful electronic data processing. J. ACM 6(1), 1–23 (1959) 23. Melnik, S.: Generic Model Management: Concepts and Algorithms. Springer, Heidelberg (2004) 24. Mork, P., Bernstein, P.A., Melnik, S.: Teaching a schema translator to produce O/R views. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 102–119. Springer, Heidelberg (2007) 25. Paolozzi, S., Atzeni, P.: Interoperability for semantic annotations. In: DEXA Workshops, pp. 445–449. IEEE Computer Society, Los Alamitos (2007) 26. Rahm, E., Do, H.: On metadata interoperability in data warehouses. Technical report, University of Leipzig (2000) 27. Soley, R., The OMG Staff Strategy Group: Model driven architecture. White paper, draft 3.2, Object Management Group (November 2000) 28. Song, G., Zhang, K., Wong, R.: Model management though graph transformations. In: IEEE Symposium on Visual Languages and Human Centric Computing, pp. 75–82 (2004)
Data Mining Using Graphics Processing Units Christian B¨ ohm1 , Robert Noll1 , Claudia Plant2 , Bianca Wackersreuther1, and Andrew Zherdin2 1 University of Munich, Germany {boehm,noll,wackersreuther}@dbs.ifi.lmu.de 2 Technische Universit¨ at M¨ unchen, Germany {plant,zherdin}@lrz.tum.de
Abstract. During the last few years, Graphics Processing Units (GPU) have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks such as rendering of 3D scenarios but can also be used for general numeric and symbolic computation tasks such as simulation and optimization. As major advantage, GPUs provide extremely high parallelism (with several hundred simple programmable processors) combined with a high bandwidth in memory transfer at low cost. In this paper, we propose several algorithms for computationally expensive data mining tasks like similarity search and clustering which are designed for the highly parallel environment of a GPU. We define a multidimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU, and define a similarity join method. Moreover, we define highly parallel algorithms for density-based and partitioning clustering. In an extensive experimental evaluation, we demonstrate the superiority of our algorithms running on GPU over their conventional counterparts in CPU.
1
Introduction
In recent years, Graphics Processing Units (GPUs) have evolved from simple devices for the display signal preparation into powerful coprocessors supporting the CPU in various ways. Graphics applications such as realistic 3D games are computationally demanding and require a large number of complex algebraic operations for each update of the display image. Therefore, today’s graphics hardware contains a large number of programmable processors which are optimized to cope with this high workload of vector, matrix, and symbolic computations in a highly parallel way. In terms of peak performance, the graphics hardware has outperformed state-of-the-art multi-core CPUs by a large margin. The amount of scientific data is approximately doubling every year [26]. To keep pace with the exponential data explosion, there is a great effort in many research communities such as life sciences [20,22], mechanical simulation [27], cryptographic computing [2], or machine learning [7] to use the computational capabilities of GPUs even for purposes which are not at all related to computer A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 63–90, 2009. Springer-Verlag Berlin Heidelberg 2009
64
C. B¨ ohm et al.
graphics. The corresponding research area is called General Processing-Graphics Processing Units (GP-GPU). In this paper, we focus on exploiting the computational power of GPUs for data mining. Data Mining consists of ’applying data analysis algorithms, that, under acceptable efficiency limitations, produce a particular enumeration of patterns over the data’ [9]. The exponential increase in data does not necessarily come along with a correspondingly large gain in knowledge. The evolving research area of data mining proposes techniques to support transforming the raw data into useful knowledge. Data mining has a wide range of scientific and commercial applications, for example in neuroscience, astronomy, biology, marketing, and fraud detection. The basic data mining tasks include classification, regression, clustering, outlier identification, as well as frequent itemset and association rule mining. Classification and regression are called supervised data mining tasks, because the aim is to learn a model for predicting a predefined variable. The other techniques are called unsupervised, because the user does not previously identify any of the variables to be learned. Instead, the algorithms have to automatically identify any interesting regularities and patterns in the data. Clustering probably is the most common unsupervised data mining task. The goal of clustering is to find a natural grouping of a data set such that data objects assigned to a common group called cluster are as similar as possible and objects assigned to different clusters differ as much as possible. Consider for example the set of objects visualized in Figure 1. A natural grouping would be assigning the objects to two different clusters. Two outliers not fitting well to any of the clusters should be left unassigned. Like most data mining algorithms, the definition of clustering requires specifying some notion of similarity among objects. In most cases, the similarity is expressed in a vector space, called the feature space. In Figure 1, we indicate the similarity among objects by representing each object by a vector in two dimensional space. Characterizing numerical properties (from a continuous space) are extracted from the objects, and taken together to a vector x ∈ Rd where d is the dimensionality of the space, and the number of properties which have been extracted, respectively. For instance, Figure 2 shows a feature transformation where the object is a certain kind of an orchid. The phenotype of orchids can be characterized using the lengths and widths of the two petal and the three sepal leaves, of the form (curvature) of the labellum, and of the colors of the different compartments. In this example, 5
Fig. 1. Example for Clustering
Data Mining Using Graphics Processing Units
65
Fig. 2. The Feature Transformation
features are measured, and each object is thus transformed into a 5-dimensional vector space. To measure the similarity between two feature vectors, usually a distance function like the Euclidean metric is used. To search in a large database for objects which are similar to a given query objects (for instance to search for a number k of nearest neighbors, or for those objects having a distance that does not exceed a threshold ), usually multidimensional index structures are applied. By a hierarchical organization of the data set, the search is made efficient. The well-known indexing methods (like e.g. the R-tree [13]) are designed and optimized for secondary storage (hard disks) or for main memory. For the use in the GPU specialized indexing methods are required because of the highly parallel but restricted programming environment. In this paper, we propose such an indexing method. It has been shown that many data mining algorithms, including clustering can be supported by a powerful database primitive: The similarity join [3]. This operator yields as result all pairs of objects in the database having a distance of less than some predefined threshold . To show that also the more complex basic operations of similarity search and data mining can be supported by novel parallel algorithms specially designed for the GPU, we propose two algorithms for the similarity join, one being a nested block loop join, and one being an indexed loop join, utilizing the aforementioned indexing structure. Finally, to demonstrate that highly complex data mining tasks can be efficiently implemented using novel parallel algorithms, we propose parallel versions of two widespread clustering algorithms. We demonstrate how the density-based clustering algorithm DBSCAN [8] can be effectively supported by the parallel similarity join. In addition, we introduce a parallel version of K-means clustering [21] which follows an algorithmic paradigm which is very different from densitybased clustering. We demonstrate the superiority of our approaches over the corresponding sequential algorithms on CPU. All algorithms for GPU have been implemented using NVIDIA’s technology Compute Unified Device Architecture (CUDA) [1]. Vendors of graphics hardware have recently anticipated the trend towards general purpose computing on GPU and developed libraries, pre-compilers and application programming interfaces to support GP-GPU applications. CUDA offers a programming interface for the
66
C. B¨ ohm et al.
C programming language in which both the host program as well as the kernel functions are assembled in a single program [1]. The host program is the main program, executed on the CPU. In contrast, the so-called kernel functions are executed in a massively parallel fashion on the (hundreds of) processors in the GPU. An analogous technique is also offered by ATI using the brand names Close-to-Metal, Stream SDK, and Brook-GP. The remainder of this paper is organized as follows: Section 2 reviews the related work in GPU processing in general with particular focus on database management and data mining. Section 3 explains the graphics hardware and the CUDA programming model. Section 4 develops an multidimensional index structure for similarity queries on the GPU. Section 5 presents the non-indexed and indexed join on graphics hardware. Section 6 and Section 7 are dedicated to GPU-capable algorithms for density-based and partitioning clustering. Section 8 contains an extensive experimental evaluation of our techniques, and Section 9 summarizes the paper and provides directions for future research.
2
Related Work
In this section, we survey the related research in general purpose computations using GPUs with particular focus on database management and data mining. General Processing-Graphics Processing Units. Theoretically, GPUs are capable of performing any computation that can be transformed to the model of parallelism and that allow for the specific architecture of the GPU. This model has been exploited for multiple research areas. Liu et al. [20] present a new approach to high performance molecular dynamics simulations on graphics processing units by the use of CUDA to design and implement a new parallel algorithm. Their results indicate a significant performance improvement on an NVIDIA GeForce 8800 GTX graphics card over sequential processing on CPU. Another paper on computations from the field of life sciences has been published by Manavski and Valle [22]. The authors propose an extremely fast solution of the Smith-Waterman algorithm, a procedure for searching for similarities in protein and DNA databases, running on GPU and implemented in the CUDA programming environment. Significant speedups are achieved on a workstation running two GeForce 8800 GTX. Another widespread application area that uses the processing power of the GPU is mechanical simulation. One example is the work by Tascora et al. [27], that presents a novel method for solving large cone complementarity problems by means of a fixed-point iteration algorithm, in the context of simulating the frictional contact dynamics of large systems of rigid bodies. As the afore reviewed approaches in the field of life sciences, the algorithm is also implemented in CUDA for a GeForce 8800 GTX to simulate the dynamics of complex systems. To demonstrate the nearly boundless possibilities of performing computations on the GPU, we introduce one more example, namely cryptographic computing [2]. In this paper, the authors present a record-breaking performance for
Data Mining Using Graphics Processing Units
67
the elliptic curve method (ECM) of integer factorization. The speedup takes advantage of two NVIDIA GTX 295 graphics cards, using a new ECM implementation relying on new parallel addition formulas and functions that are made available by CUDA. Database Management Using GPUs. Some papers propose techniques to speed up relational database operations on GPU. In [14] some algorithms for the relational join on an NVIDIA G80 GPU using CUDA are presented. Two recent papers [19,4] address the topic of similarity join in feature space which determines all pairs of objects from two different sets R and S fulfilling a certain join predicate.The most common join predicate is the -join which determines all pairs of objects having a distance of less than a predefined threshold . The authors of [19] propose an algorithm based on the concept of space filling curves, e.g. the z-order, for pruning of the search space, running on an NVIDIA GeForce 8800 GTX using the CUDA toolkit. The z-order of a set of objects can be determined very efficiently on GPU by highly parallelized sorting. Their algorithm operates on a set of z-lists of different granularity for efficient pruning. However, since all dimensions are treated equally, performance degrades in higher dimensions. In addition, due to uniform space partitioning in all areas of the data space, space filling curves are not suitable for clustered data. An approach that overcomes that kind of problem is presented in [4]. Here the authors parallelize the baseline technique underlying any join operation with an arbitrary join predicate, namely the nested loop join (NLJ), a powerful database primitive that can be used to support many applications including data mining. All experiments are performed on NVIDIA 8500GT graphics processors by the use of a CUDA-supported implementation. Govindaraju et al. [10,11] demonstrate that important building blocks for query processing in databases, e.g. sorting, conjunctive selections, aggregations, and semi-linear queries can be significantly speed up by the use of GPUs. Data Mining Using GPUs. Recent approaches concerning data mining using the GPU are two papers on clustering on GPU, that pass on the use of CUDA. In [6] a clustering approach on a NVIDIA GeForce 6800 GT graphics card is presented, that extends the basic idea of K-means by calculating the distances from a single input centroid to all objects at one time that can be done simultaneously on GPU. Thus the authors are able to exploit the high computational power and pipeline of GPUs, especially for core operations, like distance computations and comparisons. An additional efficient method that is designed to execute clustering on data streams confirms a wide practical field of clustering on GPU. The paper [25] parallelizes the K-means algorithm for use of a GPU by using multi-pass rendering and multi shader program constants. The implementation on NVIDIA 5900 and NVIDIA 8500 graphics processors achieves significant increasing performances for both various data sizes and cluster sizes. However the algorithms of both papers are not portable to different GPU models, like CUDA-approaches are.
68
3
C. B¨ ohm et al.
Architecture of the GPU
Graphics Processing Units (GPUs) of the newest generation are powerful coprocessors, not only designed for games and other graphics-intensive applications, but also for general-purpose computing (in this case, we call them GP-GPUs). From the hardware perspective, a GPU consists of a number of multiprocessors, each of which consists of a set of simple processors which operate in a SIMD fashion, i.e. all processors of one multiprocessor execute in a synchronized way the same arithmetic or logic operation at the same time, potentially operating on different data. For instance, the GPU of the newest generation GT200 (e.g. on the graphics card Geforce GTX280) has 30 multiprocessors, each consisting of 8 SIMD-processors, summarizing to a total amount of 240 processors inside one GPU. The computational power sums up to a peak performance of 933 GFLOP/s. 3.1
The Memory Model
Apart from some memory units with special purpose in the context of graphics processing (e.g. texture memory), we have three important types of memory, as visualized in Figure 3. The shared memory (SM) is a memory unit with fast access (at the speed of register access, i.e. no delay). SM is shared among all processors of a multiprocessor. It can be used for local variables but also to exchange information between threads on different processors of the same multiprocessor. It cannot be used for information which is shared among threads on different multiprocessors. SM is fast but very limited in capacity (16 KBytes per multiprocessor). The second kind of memory is the so-called device memory (DM), which is the actual video RAM of the graphics card (also used for frame buffers etc.). DM is physically located on the graphics card (but not inside the GPU), is significantly larger than SM (typically up to some hundreds of MBytes), but also significantly slower. In particular, memory accesses to DM cause a typical latency delay of 400-600 clock cycles (on G200-GPU, corresponding to 300-500ns). The bandwidth for transferring data between DM and GPU (141.7 GB/s on G200) is higher than that of CPU and main memory (about 10 GB/s on current CPUs). DM can be used to share information between threads
Fig. 3. Architecture of a GPU
Data Mining Using Graphics Processing Units
69
on different multiprocessors. If some threads schedule memory accesses from contiguous addresses, these accesses can be coalesced, i.e. taken together to improve the access speed. A typical cooperation pattern for DM and SM is to copy the required information from DM to SM simultaneously from different threads (if possible, considering coalesced accesses), then to let each thread compute the result on SM, and finally, to copy the result back to DM. The third kind of memory considered here is the main memory which is not part of the graphics card. The GPU has no access to the address space of the CPU. The CPU can only write to or read from DM using specialized API functions. In this case, the data packets have to be transferred via the Front Side Bus and the PCI-Express Bus. The bandwidth of these bus systems is strictly limited, and therefore, these special transfer operations are considerably more expensive than direct accesses of the GPU to DM or direct accesses of the CPU to main memory. 3.2
The Programming Model
The basis of the programming model of GPUs are threads. Threads are lightweight processes which are easy to create and to synchronize. In contrast to CPU processes, the generation and termination of GPU threads as well as context switches between different threads do not cause any considerable overhead either. In typical applications, thousands or even millions of threads are created, for instance one thread per pixel in gaming applications. It is recommended to create a number of threads which is even much higher than the number of available SIMD-processors because context switches are also used to hide the latency delay of memory accesses: Particularly an access to the DM may cause a latency delay of 400-600 clock cycles, and during that time, a multiprocessor may continue its work with other threads. The CUDA programming library [1] contains API functions to create a large number of threads on the GPU, each of which executes a function called kernel function. The kernel functions (which are executed in parallel on the GPU) as well as the host program (which is executed sequentially on the CPU) are defined in an extended syntax of the C programming language. The kernel functions are restricted with respect to functionality (e.g. no recursion). On GPUs the threads do not even have an individual instruction pointer. An instruction pointer is rather shared by several threads. For this purpose, threads are grouped into so-called warps (typically 32 threads per warp). One warp is processed simultaneously on the 8 processors of a single multiprocessor (SIMD) using 4-fold pipelining (totalling in 32 threads executed fully synchronously). If not all threads in a warp follow the same execution path, the different execution paths are executed in a serialized way. The number (8) of SIMD-processors per multiprocessor as well as the concept of 4-fold pipelining is constant on all current CUDA-capable GPUs. Multiple warps are grouped into thread groups (TG). It is recommended [1] to use multiples of 64 threads per TG. The different warps in a TG (as well as different warps of different TGs) are executed independently. The threads in one thread group use the same shared memory and may thus communicate and
70
C. B¨ ohm et al.
share data via the SM. The threads in one thread group can be synchronized (let all threads wait until all warps of the same group have reached that point of execution). The latency delay of the DM can be hidden by scheduling other warps of the same or a different thread group whenever one warp waits for an access to DM. To allow switching between warps of different thread groups on a multiprocessor, it is recommended that each thread uses only a small fraction of the shared memory and registers of the multiprocessor [1]. 3.3
Atomic Operations
In order to synchronize parallel processes and to ensure the correctness of parallel algorithms, CUDA offers atomic operations such as increment, decrement, or exchange (to name just those out of the large number of atomic operations, which will be needed by our algorithms). Most of the atomic operations work on integer data types in Device Memory. However, the newest version of CUDA (Compute Capability 1.3 of the GPU GT200) allows even atomic operations in SM. If, for instance, some parallel processes share a list as a common resource with concurrent reading and writing from/to the list, it may be necessary to (atomically) increment a counter for the number of list entries (which is in most cases also used as the pointer to the first free list element). Atomicity implies in this case the following two requirements: If two or more threads increment the list counter, then (1) the value counter after all concurrent increments must be equivalent to the value before plus the number of concurrent increment operations. And, (2), each of the concurrent threads must obtain a separate result of the increment operation which indicates the index of the empty list element to which the thread can write its information. Therefore, most atomic operations return a result after their execution. For instance the operation atomicInc has two parameters, the address of the counter to be incremented, and an optional threshold value which must not be exceeded by the operation. The operation works as follows: The counter value at the address is read, and incremented (provided that the threshold is not exceeded). Finally, the old value of the counter (before incrementing) is returned to the kernel method which invoked atomicInc. If two or more threads (of the same or different thread groups) call some atomic operations simultaneously, the result of these operations is that of an arbitrary sequentialization of the concurrent operations. The operation atomicDec works in an analogous way. The operation atomicCAS performs a Compare-and-Swap operation. It has three parameters, an address, a compare value and a swap value. If the value at the address equals the compare value, the value at the address is replaced by the swap value. In every case, the old value at the address (before swapping) is returned to the invoking kernel method.
4
An Index Structure for Similarity Queries on GPU
Many data mining algorithms for problems like classification, regression, clustering, and outlier detection use similarity queries as a building block. In many
Data Mining Using Graphics Processing Units
71
cases, these similarity queries even represent the largest part of the computational effort of the data mining tasks, and, therefore, efficiency is of high importance here. Similarity queries are defined as follows: Given is a database D = {x1 , ...xn } ⊆ Rd of a number n of vectors from a d-dimensional space, and a query object q ∈ Rd . We distinguish between two different kinds of similarity queries, the range queries and the nearest neighbor-queries: Definition 1 (Range Query) Let ∈ R+ 0 be a threshold value. The result of the range query is the set of the following objects: N (q) = {x ∈ D : ||x − q|| ≤ }. where ||x − q|| is an arbitrary distance function between two feature vectors x and q, e.g. the Euclidean distance. Definition 2 (Nearest Neighbor Query) The result of a nearest neighbor query is the set: N N (q) = {x ∈ D :
∀x ∈ D :
||x − q|| ≤ ||x − q||}.
Definition 2 can also be generalized for the case of the k-nearest neighbor query (N Nk (q)), where a number k of nearest neighbors of the query object q is retrieved. The performance of similarity queries can be greatly improved if a multidimensional index structure supporting the similarity search is available. Our index structure needs to be traversed in parallel for many search objects using the kernel function. Since kernel functions do not allow any recursion, and as they need to have small storage overhead by local variables etc., the index structure must be kept very simple as well. To achieve a good compromise between simplicity and selectivity of the index, we propose a data partitioning method with a constant number of directory levels. The first level partitions the data set D according to the first dimension of the data space, the second level according to the second dimension, and so on. Therefore, before starting the actual data mining method, some transformation technique should be applied which guarantees a high selectivity in the first dimensions (e.g. Principal Component Analysis, Fast Fourier Transform, Discrete Wavelet Transform, etc.). Figure 4 shows a simple, 2-dimensional example of a 2-level directory (plus the root node which is considered as level-0), similar to [16,18]. The fanout of each node is 8. In our experiments in Section 8, we used a 3-level directory with fanout 16. Before starting the actual data mining task, our simple index structure must be constructed in a bottom-up way by fractionated sorting of the data: First, the data set is sorted according to the first dimension, and partitioned into the specified number of quantile partitions. Then, each of the partitions is sorted individually according to the second dimension, and so on. The boundaries are stored using simple arrays which can be easily accessed in the subsequent kernel functions. In principle, this index construction can already be done on the GPU, because efficient sorting methods for GPU have been proposed [10]. Since bottom
72
C. B¨ ohm et al.
Fig. 4. Index Structure for GPU
up index construction is typically not very costly compared to the data mining algorithm, our method performs this preprocessing step on CPU. When transferring the data set from the main memory into the device memory in the initialization step of the data mining method, our new method has additionally to transfer the directory (i.e. the arrays in which the coordinates of the page boundaries are stored). Compared to the complete data set, the directory is always small. The most important change in the kernel functions in our data mining methods regards the determination of the -neighborhood of some given seed object q, which is done by exploiting SIMD-parallelism inside a multiprocessor. In the non-indexed version, this is done by a set of threads (inside a thread group) each of which iterates over a different part of the (complete) data set. In the indexed version, one of the threads iterates in a set of nested loops (one loop for each level of the directory) over those nodes of the index structure which represent regions of the data space which are intersected by the neighborhood-sphere of N (q). In the innermost loop, we have one set of points (corresponding to a data page of the index structure) which is processed by exploiting the SIMD-parallelism, like in the non-indexed version.
5
The Similarity Join
The similarity join is a basic operation of a database system designed for similarity search and data mining on feature vectors. In such applications, we are given a database D of objects which are associated with a vector from a multidimensional space, the feature space. The similarity join determines pairs of objects which are similar to each other. The most widespread form is the -join which determines those pairs from D × D which have a Euclidean distance of no more than a user-defined radius : Definition 3 (Similarity Join). Let D ⊆ Rd be a set of feature vectors of a d-dimensional vector space and ∈ R+ 0 be a threshold. Then the similarity join is the following set of pairs: SimJoin(D, ) = {(x, x ) ∈ (D × D) :
||x − x || ≤ } ,
Data Mining Using Graphics Processing Units
73
If x and x are elements of the same set, the join is a similarity self-join. Most algorithms including the method proposed in this paper can also be generalized to the more general case of non-self-joins in a straightforward way. Algorithms for a similarity join with nearest neighbor predicates have also been proposed. The similarity join is a powerful building block for similarity search and data mining. It has been shown that important data mining methods such as clustering and classification can be based on the similarity join. Using a similarity join instead of single similarity queries can accelerate data mining algorithms by a high factor [3]. 5.1
Similarity Join without Index Support
The baseline technique to process any join operation with an arbitrary join predicate is the nested loop join (NLJ) which performs two nested loops, each enumerating all points of the data set. For each pair of points, the distance is calculated and compared to . The pseudocode of the sequential version of NLJ is given in Figure 5.
algorithm sequentialNLJ(data set D) for each q ∈ D do // outer loop for each x ∈ D do // inner loop: search all points x which are similar to q if dist(x, q) ≤ then report (x, q) as a result pair or do some further processing on (x, q) end
Fig. 5. Sequential Algorithm for the Nested Loop Join
It is easily possible to parallelize the NLJ, e.g. by creating an individual thread for each iteration of the outer loop. The kernel function then contains the inner loop, the distance calculation and the comparison. During the complete run of the kernel function, the current point of the outer loop is constant, and we call this point the query point q of the thread, because the thread operates like a similarity query, in which all database points with a distance of no more than from q are searched. The query point q is always held in a register of the processor. Our GPU allows a truly parallel execution of a number m of incarnations of the outer loop, where m is the total number of ALUs of all multiprocessors (i.e. the warp size 32 times the number of multiprocessors). Moreover, all the different warps are processed in a quasi-parallel fashion, which allows to operate on one warp of threads (which is ready-to-run) while another warp is blocked due to the latency delay of a DM access of one of its threads. The threads are grouped into thread groups, which share the SM. In our case, the SM is particularly used to physically store for each thread group the current point x of the inner loop. Therefore, a kernel function first copies the current point x from the DM into the SM, and then determines the distance of x to the query point q. The threads of the same warp are running perfectly
74
C. B¨ ohm et al.
simultaneously, i.e. if these threads are copying the same point from DM to SM, this needs to be done only once (but all threads of the warp have to wait until this relatively costly copy operation is performed). However, a thread group may (and should) consist of multiple warps. To ensure that the copy operation is only performed once per thread group, it is necessary to synchronize the threads of the thread group before and after the copy operation using the API function synchronize(). This API function blocks all threads in the same TG until all other threads (of other warps) have reached the same point of execution. The pseudocode for this algorithm is presented in Figure 6. algorithm GPUsimpleNLJ(data set D) // host program executed on CPU // allocate memory in DM for the data set D deviceMem float D [][] := D[][]; #threads := n; // number of points in D #threadsPerGroup := 64; startThreads (simpleNLJKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); end. kernel simpleNLJKernel (int threadID) register float q[] := D [threadID][];
// copy the point from DM into the register // and use it as query point q // index is determined by the threadID // this used to be the inner loop in Figure 5
for i := 0 ... n − 1 do synchronizeThreadGroup(); // copy the current point x from DM to SM shared float x[] := D [i][]; synchronizeThreadGroup(); // Now all threads of the thread group can work with x if dist(x, q) ≤ then report (x, q) as a result pair using synchronized writing or do some further processing on (x, q) directly in kernel end.
Fig. 6. Parallel Algorithm for the Nested Loop Join on the GPU
If the data set does not fit into DM, a simple partitioning strategy can be applied. It must be ensured that the potential join partners of an object are within the same partition as the object itself. Therefore, overlapping partitions of size 2 · can be created. 5.2
An Indexed Parallel Similarity Join Algorithm on GPU
The performance of the NLJ can be greatly improved if an index structure is available as proposed in Section 4. On sequential processing architectures, the indexed NLJ leaves the outer loop unchanged. The inner loop is replaced by an index-based search retrieving candidates that may be join partners of the current object of the outer loop. The effort of finding these candidates and refining them is often orders of magnitude smaller compared to the non-indexed NLJ. When parallelizing the indexed NLJ for the GPU, we follow the same paradigm as in the last section, to create an individual thread for each point of the outer loop. It is beneficial to the performance, if points having a small distance to each other are collected in the same warp and thread group, because for those points, similar paths in the index structure are relevant.
Data Mining Using Graphics Processing Units
75
After index construction, we have not only a directory in which the points are organized in a way that facilitates search. Moreover, the points are now clustered in the array, i.e. points which have neighboring addresses are also likely to be close together in the data space (at least when projecting on the first few dimensions). Both effects are exploited by our join algorithm displayed in Figure 7.
algorithm GPUindexedJoin(data set D) deviceMem index idx := makeIndexAndSortData(D); // changes ordering of data points int #threads := |D|, #threadsPerGroup := 64; for i = 1 ... (#threads/#threadsPerGroup) do deviceMem float blockbounds[i][] := calcBlockBounds(D, blockindex); deviceMem float D [][] := D[][]; startThreads (indexedJoinKernel, #threads, #threadsPerGroup); // one thread per data point waitForThreadsToFinish (); end. algorithm indexedJoinKernel (int threadID, int blockID) // copy the point from DM into the register register float q[] := D [threadID][]; shared float myblockbounds[] := blockbounds[blockID][]; for xi := 0 ... indexsize.x do if IndexPageIntersectsBoundsDim1(idx,myblockbounds,xi ) then for yi := 0 ... indexsize.y do if IndexPageIntersectsBoundsDim2(idx,myblockbounds,xi , yi ) then for zi := 0 ... indexsize.z do if IndexPageIntersectsBoundsDim3(idx,myblockbounds,xi , yi , zi ) then for w := 0 ... IndexPageSize do synchronizeThreadGroup(); shared float p[] :=GetPointFromIndexPage(idx,D , xi , yi , zi , w); synchronizeThreadGroup(); if dist(p, q) ≤ then report (p, q) as a result pair using synchronized writing end.
Fig. 7. Algorithm for Similarity Join on GPU with Index Support
Instead of performing an outer loop like in a sequential indexed NLJ, our algorithm now generates a large number of threads: One thread for each iteration of the outer loop (i.e. for each query point q). Since the points in the array are clustered, the corresponding query points are close to each other, and the join partners of all query points in a thread group are likely to reside in the same branches of the index as well. Our kernel method now iterates over three loops, each loop for one index level, and determines for each partition if the point is inside the partition or, at least no more distant to its boundary than . The corresponding subnode is accessed if the corresponding partition is able to contain join partners of the current point of the thread. When considering the warps which operate in a fully synchronized way, a node is accessed, whenever at least one of the query points of the warps is close enough to (or inside) the corresponding partition. For both methods, indexed and non-indexed nested loop join on GPU, we need to address the question how the resulting pairs are processed. Often, for example to support density-based clustering (cf. Section 6), it is sufficient to return a counter with the number of join partners. If the application requires to
76
C. B¨ ohm et al.
report the pairs themselves, this is easily possible by a buffer in DM which can be copied to the CPU after the termination of all kernel threads. The result pairs must be written into this buffer in a synchronized way to avoid that two threads write simultaneously to the same buffer area. The CUDA API provides atomic operations (such as atomic increment of a buffer pointer) to guarantee this kind of synchronized writing. Buffer overflows are also handled by our similarity join methods. If the buffer is full, all threads terminate and the work is resumed after the buffer is emptied by the CPU.
6
Similarity Join to Support Density-Based Clustering
As mentioned in Section 5, the similarity join is an important building block to support a wide range of data mining tasks, including classification [24], outlier detection [5] association rule mining [17] and clustering [8], [12]. In this section, we illustrate how to effectively support the density-based clustering algorithm DBSCAN [8] with the similarity join on GPU. 6.1
Basic Definitions and Sequential DBSCAN
The idea of density-based clustering is that clusters are areas of high point density, separated by areas of significantly lower point density. The point density can be formalized using two parameters, called ∈ R+ and M inP ts ∈ N+ . The central notion is the core object. A data object x is called a core object of a cluster, if at least M inP ts objects (including x itself) are in its -neighborhood N (x), which corresponds to a sphere of radius . Formally: Definition 4. (Core Object) Let D be a set of n objects from Rd , ∈ R+ and M inP ts ∈ N+ . An object x ∈ D is a core object, if and only if |N (x)| ≥ M inP ts, where N (x) = {x ∈ D : ||x − x|| ≤ }. Note that this definition is equivalent to Definition 1. Two objects may be assigned to a common cluster. In density-based clustering this is formalized by the notions direct density reachability, and density connectedness. Definition 5. (Direct Density Reachability) Let x, x ∈ D. x is called directly density reachable from x (in symbols: x x ) if and only if 1. x is a core object in D, and 2. x ∈ N (x). If x and x are both core objects, then x x is equivalent with x x . The density connectedness is the transitive and symmetric closure of the direct density reachability:
Data Mining Using Graphics Processing Units
77
Definition 6. (Density Connectedness) Two objects x and x are called density connected (in symbols: x x ) if and only if there is a sequence of core objects (x1 , ..., xm ) of arbitrary length m such that x x1 ... xm x . In density-based clustering, a cluster is defined as a maximal set of density connected objects: Definition 7. (Density-based Cluster) A subset C ⊆ D is called a cluster if and only if the following two conditions hold: 1. Density connectedness: ∀x, x ∈ C : x x . 2. Maximality: ∀x ∈ C, ∀x ∈ D \ C : ¬x x . The algorithm DBSCAN [8] implements the cluster notion of Definition 7 using a data structure called seed list S containing a set of seed objects for cluster expansion. More precisely, the algorithm proceeds as follows: 1. Mark all objects as unprocessed. 2. Consider an arbitrary unprocessed object x ∈ D. 3. If x is a core object, assign a new cluster ID C, and do step (4) for all elements x ∈ N (x) which do not yet have a cluster ID: 4. (a) mark the element x with the cluster ID C and (b) insert the object x into the seed list S. 5. While S is not empty repeat step 6 for all elements s ∈ S: 6. If s is a core object, do step (7) for all elements x ∈ N (s) which do not yet have any cluster ID: 7. (a) mark the element x with the cluster ID C and (b) insert the object x into the seed list S. 8. If there are still unprocessed objects in the database, continue with step (2). To illustrate the algorithmic paradigm, Figure 8 displays a snapshot of DBSCAN during cluster expansion. The light grey cluster on the left side has been processed already. The algorithm currently expands the dark grey cluster on the right side. The seedlist S currently contains one object, the object x. x is a core object since there are more than M inP ts = 3 objects in its -neighborhood (|N (x)| = 6, including x itself). Two of these objects, x and x have not been processed so far and are therefore inserted into S. This way, the cluster is iteratively expanded until the seed list is empty. After that, the algorithm continues with an arbitrary unprocessed object until all objects have been processed. Since every object of the database is considered only once in Step 2 or 6 (exclusively), we have a complexity which is n times the complexity of N (x) (which is linear in n if there is no index structure, and sublinear or even O(log(n)) in the presence of a multidimensional index structure. The result of DBSCAN is determinate.
78
C. B¨ ohm et al.
Fig. 8. Sequential Density-based Clustering
algorithm GPUdbscanNLJ(data set D) // host program executed on CPU // allocate memory in DM for the data set D deviceMem float D [][] := D[][]; deviceMem int counter [n]; // allocate memory in DM for counter #threads := n; // number of points in D #threadsPerGroup := 64; startThreads (GPUdbscanKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); copy counter from DM to main memory ; end. kernel GPUdbscanKernel (int threadID) register float q[] := D [threadID][];
// copy the point from DM into the register // and use it as query point q // index is determined by the threadID // option 1 OR // option 2
for i := 0 ... threadID do for i := 0 ... n − 1 do synchronizeThreadGroup(); // copy the current point x from DM to SM shared float x[] := D [i][]; synchronizeThreadGroup(); // Now all threads of the thread group can work with x if dist(x, q) ≤ then atomicInc (counter[i]); atomicInc (counter[threadID]); // option 1 OR inc counter[threadID]; // option 2 end.
Fig. 9. Parallel Algorithm for the Nested Loop Join to Support DBSCAN on GPU
6.2
GPU-Supported DBSCAN
To effectively support DBSCAN on GPU we first identify the two major stages of the algorithm requiring most of the processing time: 1. Determination of the core object property. 2. Cluster expansion by computing the transitive closure of the direct density reachability relation. The first stage can be effectively supported by the similarity join. To check the core object property, we need to count the number of objects which are within the -neighborhood of each point. Basically, this can be implemented by a self join. However, the algorithm for self-join described in Section 5 needs to be modified to be suitable to support this task. The classical self-join only counts the total number of pairs of data objects with distance less or equal than . For the core object property, we need a self-join with a counter associated to
Data Mining Using Graphics Processing Units
79
each object. Each time when the algorithm detects a new pair fulfilling the join condition, the counter of both objects needs to be incremented. We propose two different variants to implement the self-join to support DBSCAN on GPU which are displayed in pseudocode in Figure 9. Modifications over the basic algorithm for nested loop join (cf. Figure 6) are displayed in darker color. As in the simple algorithm for nested loop join, for each point q of the outer loop a separate thread with a unique threadID is created. Both variants of the self-join for DBSCAN operate on a array counter which stores the number of neighbors for each object. We have two options how to increment the counters of the objects when a pair of objects (x, q) fulfills the join condition. Option 1 is first to add the counter of x and then the counter of q using the atomic operation atomicInc() (cf. Section 3). The operation atomicInc() involves synchronization of all threads. The atomic operations are required to assure the correctness of the result, since it is possible that different threads try to increment the counters of objects simultaneously. In clustering, we typically have many core objects which causes a large number of synchronized operations which limit parallelism. Therefore, we also implemented option 2 which guarantees correctness without synchronized operations. Whenever a pair of objects (x, q) fulfills the join condition, we only increment the counter of point q. Point q is that point of the outer loop for which the thread has been generated, which means q is exclusively associated with the threadID. Therefore, the cell counter[threadID] can be safely incremented with the ordinary, non-synchronized operation inc(). Since no other point is associated with the same threadID as q no collision can occur. However, note that in contrast to option 1, for each point of the outer loop, the inner loop needs to consider all other points. Otherwise results are missed. Recall that for the conventional sequential nested loop join (cf. Figure 5) it is sufficient to consider in the inner loop only those points which have not been processed so far. Already processed points can be excluded because if they are join partners of the current point, this has already been detected. The same holds for option 1. Because of parallelism, we can not state which objects have been already processed. However, it is still sufficient when each object searches in the inner loop for join partners among those objects which would appear later in the sequential processing order. This is because all other object are addressed by different threads. Option 2 requires checking all objects since only one counter is incremented. With sequential processing, option 2 would thus duplicate the workload. However, as our results in Section 8 demonstrate, option 2 can pay-off under certain conditions since parallelism is not limited by synchronization. After determination of the core object property, clusters can be expanded starting from the core objects. Also this second stage of DBSCAN can be effectively supported on the GPU. For cluster expansion, it is required to compute the transitive closure of the direct density reachability relation. Recall that this is closely connected to the core object property as all objects within the range of a core object x are directly density reachable from x. To compute the transitive closure, standard algorithms are available. The most well-known among them is
80
C. B¨ ohm et al.
the algorithm of Floyd-Warshall. A highly parallel variant of the Floyd-Warshall algorithm on GPU has been recently proposed [15], but this is beyond the scope of this paper.
7
K-Means Clustering on GPU
7.1
The Algorithm K-Means
A well-established partitioning clustering method is the K-means clustering algorithm [21]. K-means requires a metric distance function in vector space. In addition, the user has to specify the number of desired clusters k as an input parameter. Usually K-means starts with an arbitrary partitioning of the objects into k clusters. After this initialization, the algorithm iteratively performs the following two steps until convergence: (1) Update centers: For each cluster, compute the mean vector of its assigned objects. (2). Re-assign objects: Assign each object to its closest center. The algorithm converges as soon as no object changes its cluster assignment during two subsequent iterations. Figure 10 illustrates an example run of K-means for k = 3 clusters. Figure 10(a) shows the situation after random initialization. In the next step, every data point is associated with the closest cluster center (cf. Figure 10(b)). The resulting partitions represent the Voronoi cells generated by the centers. In the following step of the algorithm, the center of each of the k clusters is updated, as shown in Figure 10(c). Finally, assignment and update steps are repeated until convergence. In most cases, fast convergence can be observed. The optimization function of K-means is well defined. The algorithm minimizes the sum of squared distances of the objects to their cluster centers. However, K-means is only guaranteed to converge towards a local minimum of the objective function. The quality of the result strongly depends on the initialization. Finding that clustering with k clusters minimizing the objective function actually is a NP-hard problem, for details see e.g. [23]. In practice, it is therefore recommended to run the algorithm several times with different random initializations and keep the best result. For large data sets, however, often only a very limited number of trials is feasible. Parallelizing K-means in GPU allows for a more comprehensive exploration of
(a) Initialization
(b) Assignment
(c) Recalculation
(d) Termination
Fig. 10. Sequential Partitioning Clustering by the K-means Algorithm
Data Mining Using Graphics Processing Units
81
the search space of all potential clusterings and thus provides the potential to obtain a good and reliable clustering even for very large data sets. 7.2
CUDA-K-Means
In K-means, most computing power is spent in step (2) of the algorithm, i.e. re-assignment which involves distance computation and comparison. The number of distance computations and comparisons in K-means is O(k · i · n), where i denotes the number of iterations and n is the number of data points. The CUDA-K-meansKernel. In K-means clustering, the cluster assignment of each data point is determined by comparing the distances between that point and each cluster center. This work is performed in parallel by the CUDA-KmeansKernel. The idea is, instead of (sequentially) performing cluster assignment of one single data point, we start many different cluster assignments at the same time for different data points. In detail, one single thread per data point is generated, all executing the CUDA-K-meansKernel. Every thread which is generated from the CUDA-K-meansKernel (cf. Figure 11) starts with the ID of a data point x which is going to be processed. Its main tasks are, to determine the distance to the next center and the ID of the corresponding cluster.
algorithm CUDA-K-means(data set D, int k) deviceMem float D [][] := D[][]; #threads := |D|; #threadsPerGroup := 64; deviceMem float Centroids[][] := initCentroids(); double actCosts := ∞;
// host program executed on CPU // allocate memory in DM for the data set D // number of points in D // allocate memory in DM for the // initial centroids // initial costs of the clustering
repeat prevCost := actCost; startThreads (CUDA-K-meansKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); float minDist := minDistances[threadID]; // copy the distance to the nearest // centroid from DM into MM float cluster := clusters[threadID]; // copy the assigned cluster from DM into MM double actCosts := calculateCosts(); // update costs of the clustering deviceMem float Centroids[][] := calculateCentroids(); // copy updated centroids to DM until |actCost − prevCost| < threshold // convergence end.
kernel CUDA-K-meansKernel (int threadID) register float x[] := D [threadID][]; // copy the point from DM into the register float minDist := ∞; // distance of x to the next centroid int cluster := null; // ID of the next centroid (cluster) for i := 1 ... k do // process each cluster register float c[] := Centroids[i][] // copy the actual centroid from DM into the register double dist := distance(x,c); if dist < minDist then minDist := dist; cluster := i; report(minDist, cluster); // report assigned cluster and distance using synchronized writing end.
Fig. 11. Parallel Algorithm for K-means on the GPU
82
C. B¨ ohm et al.
A thread starts by reading the coordinates of the data point x into the register. The distance of x to its closest center is initialized by ∞ and the assigned cluster is therefore set to null. Then a loop encounters all c1 , c2 , . . . , ck centers and considers them as potential clusters for x. This is done by all threads in the thread group allowing a maximum degree of intra-group parallelism. Finally, the cluster whose center has the minimum distance to the data point x is reported together with the corresponding distance value using synchronized writing. The Main Program for CPU. Apart from initialization and data transfer from main memory (MM) to DM, the main program consists of a loop starting the CUDA-K-meansKernel on the GPU until the clustering converges. After the parallel operations are completed by all threads of the group, the following steps are executed in each cycle of the loop: 1. 2. 3. 4.
Copy distance of processed point x to the nearest center from DM into MM. Copy cluster, x is assigned to, from DM into MM. Update centers. Copy updated centers to DM.
A pseudocode of these procedures is illustrated in Figure 11.
8
Experimental Evaluation
To evaluate the performance of data mining on the GPU, we performed various experiments on synthetic data sets. The implementation for all variants is written in C and all experiments are performed on a workstation with Intel Core 2 Duo CPU E4500 2.2 GHz and 2 GB RAM which is supplied with a Gainward NVIDIA GeForce GTX280 GPU (240 SIMD-processors) with 1GB GDDR3 SDRAM. 8.1
Evaluation of Similarity Join on the GPU
The performance of similarity join on the GPU, is validated by the comparison of four different variants for executing similarity join: 1. 2. 3. 4.
Nested loop join (NLJ) on the CPU NLJ on the CPU with index support (as described in Section 4) NLJ on the GPU NLJ on the GPU with index support (as described in Section 4)
For each version we determine the speedup factor by the ratio of CPU runtime and GPU runtime. For this purpose we generated three 8-dimensional synthetic data sets of various sizes (up to 10 million (m) points) with different data distributions, as summarized in Table 1. Data set DS1 contains uniformly distributed data. DS2 consists of five Gaussian clusters which are randomly distributed in feature space (see Figure 12(a)). Similar to DS2 , DS3 is also composed of five Gaussian clusters, but the clusters are correlated. An illustration of data set
Data Mining Using Graphics Processing Units
83
Table 1. Data Sets for the Evaluation of the Similarity Join on the GPU
Name DS1 (a) Random Clusters
(b) Linear Clusters
Fig. 12. Illustration of the data sets DS2 and DS3
Size
Distribution
3m - 10m points uniform distribution
DS2 250k - 1m points normal distribution, gaussian clusters DS3 250k - 1m points normal distribution, gaussian clusters
DS3 is given in Figure 12(b). The threshold was selected to obtain a join result where each point was combined with one or two join partners on average. Evaluation of the Size of the Data Sets. Figure 13 displays the runtime in seconds and the corresponding speedup factors of NLJ on the CPU with/without index support and NLJ on the GPU with/without index support in logarithmic scale for all three data sets DS1 , DS2 and DS3 . The time needed for data transfer from CPU to the GPU and back as well as the (negligible) index construction time has been included. The tests on data set DS1 were performed with a join selectivity of = 0.125, and = 0.588 on DS2 and DS3 respectively. NLJ on the GPU with index support performs best in all experiments, independent of the data distribution or size of the data set. Note that, due to massive parallelization, NLJ on the GPU without index support outperforms CPU without index by a large factor (e.g. 120 on 1m points of normal distributed data with gaussian clusters). The GPU algorithm with index support outperforms the corresponding CPU algorithm (with index) by a factor of 25 on data set DS2 . Remark that for example the overall improvement of the indexed GPU algorithm on data set DS2 over the non-indexed CPU version is more than 6,000. This results demonstrate the potential of boosting performance of database operations with designing specialized index structures and algorithms for the GPU. Evaluation of the Join Selectivity. In these experiments we test the impact of the parameter on the performance of NLJ on GPU with index support and use the indexed implementation of NLJ on the CPU as benchmark. All experiments are performed on data set DS2 with a fixed size of 500k data points. The parameter is evaluated in a range from 0.125 to 0.333. Figure 14(a) shows that the runtime of NLJ on GPU with index support increases for larger values. However, the GPU version outperforms the CPU implementation by a large factor (cf. Figure 14(b)), that is proportional to the value of . In this evaluation the speedup ranges from 20 for a join selectivity of 0.125 to almost 60 for = 0.333.
C. B¨ ohm et al.
Tim me((sec)
10000000.0 1000000.0 100000.0 10000.0 1000.0 100.0 10.0 1.0
CPU CPU indexed CPUindexed GPU GPUindexed 2
4
6
8
Sp peedupFFacto or
84
150.0 130.0 130 0 110.0 90.0 70.0 50.0 30.0 10.0
Without index Withoutindex Withindex
2
10 12
Tim me(ssec)
10000.0 1000.0
CPU CPU i d d CPUindexed GPU GPUindexed
100.0 10.0 10 1.0 700
100
700
1000
1000.0
CPU CPUindexed CPU indexed GPU GPUindexed
100.0 10.0 1.0 1000
Size(k)
(e) Runtime on Data Set DS3
(d) Speedup on Data Set DS2 Speed dupFactor
10000.0
Time(sec)
400
Size(k)
100000.0
700
12
Withoutindex Without index Withindex
Size(k)
400
10
150.0 130.0 110.0 90.0 70.0 50.0 30.0 10.0
1000
(c) Runtime on Data Set DS2
0.1 100
8
(b) Speedup on Data Set DS1 Speed dupFactor
100000.0
400
6
Size(m)
Size(m)
(a) Runtime on Data Set DS1
0.1 0 1 100
4
150.0 130.0 110.0 90.0 70.0 50.0 30.0 10.0
Withoutindex Without index Withindex
100
400
700
1000
Size(k)
(f) Speedup on Data Set DS3
Fig. 13. Evaluation of the NLJ on CPU and GPU with and without Index Support w.r.t. the Size of Different Data Sets
Evaluation of the Dimensionality. These experiments provide an evaluation with respect to the dimensionality of the data. As in the experiments for the evaluation of the join selectivity, we use again the indexed implementations both on CPU and GPU and perform all tests on data set DS2 with a fixed number of 500k data objects. The dimensionality is evaluated in a range from 8 to 32. We also performed these experiments with two different settings for the join selectivity, namely = 0.588 and = 1.429. Figure 15 illustrates that NLJ on GPU outperforms the benchmark method on CPU by factors of about 20 for = 0.588 to approximately 70 for = 1.429. This order of magnitude is relatively independent of the data dimensionality. As in our implementation the dimensionality is already known at compile time, optimization techniques of the compiler have an impact on the performance of
Tim me(se ec)
1000.0 100.0 CPU GPU
10.0
Speed dupFactor
Data Mining Using Graphics Processing Units
1.0 0.10
0.15
0.20
0.25
0.30
0.35
85
70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 0.10
0.15
0.20
epsilon
0.25
0.30
0.35
epsilon
(a) Runtime on Data Set DS2
(b) Speedup on Data Set DS2
Fig. 14. Impact of the Join Selectivity on the NLJ on GPU with Index Support
the CPU version as can be seen especially in Figure 15(c). However the dimensionality also affects the implementation on GPU, because higher dimensional data come along with a higher demand of shared memory. This overhead affects the number of threads that can be executed in parallel on the GPU. 8.2
Evaluation of GPU-Supported DBSCAN
As described in Section 6.2, we suggest two different variants to implement the self-join to support DBSCAN on GPU, whose characteristic are briefly reviewed in the following:
Time(se ec)
100 0 100.0 CPU GPU
10.0
Speed dupFactor
100.0
1000.0
1.0
80.0 60.0 40 0 40.0 20.0 0.0
2
6
10 14 18 22 26 30
2
6
14
18
22
26
30
Dimensionality
Dimensionality
(a) Runtime on Data Set D2 ( = 0.588)
(b) Speedup on Data Set D2 ( = 0.588)
10.0
CPU GPU
1.0
Speed dupFactor
100.0
100.0
Time(sec)
10
80.0 60.0 40 0 40.0 20.0 0.0
2
6
10 14 18 22 26 30
Dimensionality
(c) Speedup on Data Set D2 ( = 1.429)
2
6
10
14
18
22
26
30
Dimensionality
(d) Speedup on Data Set D2 ( = 1.429)
Fig. 15. Impact of the Dimensionality on the NLJ on GPU with Index Support
86
C. B¨ ohm et al.
Tim me(se ec)
1000.0 100.0 Synchronization 10.0
no Synchronization
1.0 0.10
0.35
0.60
0.85
1.10
epsilon
Fig. 16. Evaluation of two versions for the self-join on GPU w.r.t. the join selectivity
1. Increment of the counters regarding a pair of objects (x, q) that fulfills the join condition is done by the use of an atomic operation that involves synchronization of all threads. 2. Increment of the counters can be performed without synchronization but with duplicated workload instead. We evaluate both options on a synthetic data set with 500k points generated as specified as DS1 in Table 1. Figure 16 displays the runtime of both options. For ≤ 0.6, the runtime is in the same order of magnitude, the synchronized variant 1 being slightly more efficient. From this point on, the non-synchronized variant 2 is clearly outperforming variant 1 since parallelism is not limited by synchronization. 8.3
Evaluation of CUDA-K-Means
To analyze the efficiency of K-means clustering on the GPU, we present experiments with respect to different data set sizes, number of clusters and dimensionality of the data. As benchmark we apply a single-threaded implementation of K-means on the CPU to determine the speedup of the implementation of K-means on the GPU. As the number of iterations may vary in each run of the experiments, all results are normalized by a number of 50 iterations both on the GPU and the CPU implementation of K-means. All experiments are performed on synthetic data sets as described in detail in each of the following settings. Evaluation of the Size of the Data Set. For these experiments we created 8-dimensional synthetic data sets of different size, ranging from 32k to 2m data points. The data sets consist of different numbers of random clusters, generated as as specified as DS1 in Table 1. Figure 17 displays the runtime in seconds in logarithmic scale and the corresponding speedup factors of CUDA-K-means and the benchmark method on the CPU for different number of clusters. The time needed for data transfer from CPU to GPU and back has been included. The corresponding speedup factors are given in Figure 17(d). Once again, these experiments support the evidence that the performance of data mining approaches on GPU outperform classic
1000.0
1000.0
100.0
100.0
CPU GPU
10.0 10 0 1.0 0
1000
Time(sec)
Time(sec)
Data Mining Using Graphics Processing Units
CPU GPU
10.0 10 0 1.0
2000
0
0.1
2000
Size(k)
(a) Runtime for 32 clusters
(b) Runtime for 64 clusters
100000.0
1000.0 CPU GPU
100.0 100 0 10.0
Sp peedupFFacto or
1200.0
10000.0
Time(sec)
1000
0.1
Size(k)
1000.0 1000 0 800.0 600.0
k=256 k=64 k=32 k 32
400.0 200.0 200 0 0.0
1.0 0.1 0
87
1000
2000
Size(k)
(c) Runtime for 256 clusters
0
1000
2000
Size(k)
(d) Speedup for 32, 64 and 256 clusters
Fig. 17. Evaluation of CUDA-K-means w.r.t. the Size of the Data Set
CPU versions by significant factors. Whereas a speedup of approximately 10 to 100 can be achieved for relatively small number of clusters, we obtain a speedup of about 1000 for 256 clusters, that is even increasing with the number of data objects. Evaluation of the Impact of the Number of Clusters. We performed several experiments to validate CUDA-K-means with respect to the number of clusters K. Figure 18 shows the runtime in seconds of CUDA-K-means compared with the implementation of K-means on the CPU on 8-dimensional synthetic data sets that contain different number of clusters, ranging from 32 to 256, again together with the corresponding speedup factors in Figure 18(d). The experimental evaluation of K on a data set that consists of 32k points results in a maximum performance benefit of more than 800 compared to the benchmark implementation. For 2m points the speedup ranges from nearly 100 up to even more than 1,000 for a data set that comprises 256 clusters. In this case the calculation on the GPU takes approximately 5 seconds, compared to almost 3 hours on the CPU. Therefore, we determine that due to massive parallelization, CUDA-K-means outperforms CPU by large factors, that are even growing with K and the number of data objects n. Evaluation of the Dimensionality. These experiments provide an evaluation with respect to the dimensionality of the data. We perform all tests on synthetic
88
C. B¨ ohm et al.
1000.0
Time(sec)
100.0 CPU GPU
10.0 10 0
Tim me(sec)
10000.0 1000.0 100.0
1.0 0
64
128
192
CPU GPU
10.0 1.0
256
0
0.1
64
128
k
(a) Runtime for 32k points
256
(b) Runtime for 500k points 1200.0
1000.0 100.0 CPU GPU
10.0
Speed dupFactor
10000.0
Tim me(sec)
192
k
1000.0 800.0 600.0
2mpoints 500kpoints 32kpoints
400.0 200.0
1.0
0.0 0
64
128
192
256
0
64 128 192 256
k
k
(c) Runtime for 2m points
(d) Speedup for 32k, 500k and 2m points
Fig. 18. Evaluation of CUDA-K-means w.r.t. the number of clusters K
data consisting of 16k data objects. The dimensionality of the test data sets vary in a range from 4 to 256. Figure 19(b) illustrates that CUDA-K-means outperforms the benchmark method K-means on the CPU by factors of 230 for 128-dimensional data to almost 500 for 8-dimensional data. On the GPU and the CPU, the dimensionality affects possible compiler optimization techniques, like loop unrolling as already shown in the experiments for the evaluation of the similarity join on the GPU. In summary, the results of this section demonstrate the high potential of boosting performance of complex data mining techniques by designing specialized index structures and algorithms for the GPU. 10000.0
Tiime(sec))
1000.0 100.0 CPU GPU
10.0 1.0 01 0.1
Sp peedupFFacto or
700.0 600.0 500.0 500 0 400 0 400.0 300 0 300.0 200.0 0
64
128
192
Dimensionality
(a) Runtime
256
0
64
128
192
256
Dimensionality
(b) Speedup
Fig. 19. Impact of the Dimensionality of the Data Set on CUDA-K-means
Data Mining Using Graphics Processing Units
9
89
Conclusions
In this paper, we demonstrated how Graphics processing Units (GPU) can effectively support highly complex data mining tasks. In particular, we focussed on clustering. With the aim of finding a natural grouping of an unknown data set, clustering certainly is among the most wide spread data mining tasks with countless applications in various domains. We selected two well-known clustering algorithms, the density-based algorithm DBSCAN and the iterative algorithm Kmeans and proposed algorithms illustrating how to effectively support clustering on GPU. Our proposed algorithms are accustomed to the special environment of the GPU which is most importantly characterized by extreme parallelism at low cost. A single GPU consists of a large number of processors. As buildings blocks for effective support of DBSCAN, we proposed a parallel version of the similarity join and an index structure for efficient similarity search. Going beyond the primary scope of this paper, these building blocks are applicable to support a wide range of data mining tasks, including outlier detection, association rule mining and classification. To illustrate that not only local density-based clustering can be efficiently performed on GPU, we additionally proposed a parallelized version of K-means clustering. Our extensive experimental evaluation emphasizes the potential of the GPU for high-performance data mining. In our ongoing work, we develop further algorithms to support more specialized data mining tasks on GPU, including for example subspace and correlation clustering and medical image processing.
References 1. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide (2007) 2. Bernstein, D.J., Chen, T.-R., Cheng, C.-M., Lange, T., Yang, B.-Y.: Ecm on graphics cards. In: Soux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 483–501. Springer, Heidelberg (2009) 3. B¨ ohm, C., Braunm¨ uller, B., Breunig, M.M., Kriegel, H.-P.: High performance clustering based on the similarity join. In: CIKM, pp. 298–305 (2000) 4. B¨ ohm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. In: BTW, pp. 57–66 (2009) 5. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104 (2000) 6. Cao, F., Tung, A.K.H., Zhou, A.: Scalable clustering using graphics processors. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 372–384. Springer, Heidelberg (2006) 7. Catanzaro, B.C., Sundaram, N., Keutzer, K.: Fast support vector machine training and classification on graphics processors. In: ICML, pp. 104–111 (2008) 8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996) 9. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)
90
C. B¨ ohm et al.
10. Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: Gputerasort: high performance graphics co-processor sorting for large database management. In: SIGMOD Conference, pp. 325–336 (2006) 11. Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M.C., Manocha, D.: Fast computation of database operations using graphics processors. In: SIGMOD Conference, pp. 215–226 (2004) 12. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: SIGMOD Conference, pp. 73–84 (1998) 13. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 47–57 (1984) 14. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008) 15. Katz, G.J., Kider, J.T.: All-pairs shortest-paths for large graphs on the gpu. In: Graphics Hardware, pp. 47–55 (2008) 16. Kitsuregawa, M., Harada, L., Takagi, M.: Join strategies on kd-tree indexed relations. In: ICDE, pp. 85–93 (1989) 17. Koperski, K., Han, J.: Discovery of spatial association rules in geographic information databases. In: Egenhofer, M.J., Herring, J.R. (eds.) SSD 1995. LNCS, vol. 951, pp. 47–66. Springer, Heidelberg (1995) 18. Leutenegger, S.T., Edgington, J.M., Lopez, M.A.: Str: A simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997) 19. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008) 20. Liu, W., Schmidt, B., Voss, G., M¨ uller-Wittig, W.: Molecular dynamics simulations on commodity gpus with cuda. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 185–196. Springer, Heidelberg (2007) 21. Macqueen, J.B.: Some methods of classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 22. Manavski, S., Valle, G.: Cuda compatible gpu cards as efficient hardware accelerators for smith-waterman sequence alignment. BMC Bioinformatics 9 (2008) 23. Meila, M.: The uniqueness of a good optimum for k-means. In: ICML, pp. 625–632 (2006) 24. Plant, C., B¨ ohm, C., Tilg, B., Baumgartner, C.: Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data. Bioinformatics 22(8), 981–988 (2006) 25. Shalom, S.A.A., Dash, M., Tue, M.: Efficient k-means clustering using accelerated graphics processors. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 166–175. Springer, Heidelberg (2008) 26. Szalay, A., Gray, J.: 2020 computing: Science in an exponential world. Nature 440, 413–414 (2006) 27. Tasora, A., Negrut, D., Anitescu, M.: Large-scale parallel multi-body dynamics with frictional contact on the graphical processing unit. Proc. of Inst. Mech. Eng. Journal of Multi-body Dynamics 222(4), 315–326
Context-Aware Data and IT Services Collaboration in E-Business Khouloud Boukadi1, Chirine Ghedira2, Zakaria Maamar3, Djamal Benslimane2, and Lucien Vincent1 1
Ecole des Mines, Saint Etienne, France 2 University Lyon 1, France 3 Zayed University, Dubai, U.A.E. {boukadi,Vincent}@emse.fr, {cghedira,dbenslim}@liris.cnrs.fr,
[email protected]
Abstract. This paper discusses the use of services in the design and development of adaptable business processes, which should let organizations quickly react to changes in regulations and needs. Two types of services are adopted namely Data and Information Technology. A data service is primarily used to hide the complexity of accessing distributed and heterogeneous data sources, while an information technology service is primarily used to hide the complexity of running requests that cross organizational boundaries. The combination of both services takes place under the control of another service, which is denoted by service domain. A service domain orchestrates and manages data and information technology services in response to the events that arise and changes that occur. This happens because service domains are sensible to context. Policies and aspect-oriented programming principles support the exercise of packaging data and information technology services into service domains as well as making service domains adapt to business changes. Keywords: service, service adaptation, context, aspect-oriented programming, policy.
1 Introduction With the latest development of technologies for knowledge management on the one hand, and techniques for project management on the other hand, both coupled with the widespread use of the Internet, today’s enterprises are now under the pressure of adjusting their know-how and enforcing their best practices. These enterprises have to be more focused on their core competencies and hence, have to seek the support of other peers through partnership to carry out their non-core competencies. The success of this partnership depends on how business processes are designed as these processes should be loosely coupled and capable to cross organizational boundaries. Since the inception of the Service-Oriented Architecture (SOA) paradigm along with its multiple implementation technologies such as Jini services and Web services, the focus of the industry community has been on providing tools that would allow seamless and flexible application integration within and across organizational boundaries. Indeed, A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 91–115, 2009. © Springer-Verlag Berlin Heidelberg 2009
92
K. Boukadi et al.
SOA offers solutions to interoperability, adaptability, and scalability challenges that today’s enterprises have to tackle. The objective, here, is to let enterprises collaborate by putting their core services together, which leads to the creation of new applications that should be responsive to changes in business requirements and regulations. Nevertheless, looking at enterprise applications from a narrowed perspective, which consists of services and processes only, has somehow overlooked the data that these applications use in terms of input and output. Data identification and integration are left at a later stage of the development cycle of these applications, which is not very convenient when these data are spread over different sources including relational databases, silos of data-centric homegrown or packaged applications, XML files, just to cite a few [1]. As a result, data identification and integration turn out tedious for SOA application developers: bits and pieces of data need to be retrieved/updated from/in heterogeneous data sources with different interfaces and access methods. This situation has undermined SOA benefits and forced the SOA community to recognize the importance of adopting a data-oriented view of services. This has resulted into the emergence of the concept of data services. In this paper, we look into ways of exposing IT applications and data sources as services in compliance with SOA principles. To this end, we treat a service as either an IT Service (ITS) or a Data Service (DS) and identify the necessary mechanisms that would let ITSs and DSs first, work hand-in-hand during the integration exercise of enterprise applications and second, engage in a controlled way in high-level functionalities to be referred to as Service Domains (SDs). Basically, a SD orchestrates ITSs and DSs in order to provide ready-to-use high-level functionalities to users. By ready-to-use we mean service publication, selection, and combination of fine-grained ITSs and DSs are already complete. We define ITSs as active components that make changes in the environment using update operations that empower them, whereas DSs as passive component that return data only (consultation) and thus, do not impact the environment. In this paper we populate this environment with specific elements, which we denote by Business Objects (BOs). The dynamic nature of BOs (e.g., new BOs are made available, some cease to exist without prior notice, etc.) and the business processes that are built upon these BOs, requires that SDs should be flexible and sensible to all changes in an enterprise’s requirements and regulations. We enrich the specification of SDs with contextual details such as execution status of each participating service (DS or ITS), type of failure along with the corrective actions, etc. This enrichment is illustrated with the de facto standard namely the Business Process Execution Language (BPEL) for service integration. BPEL specifies a business process behavior through automated process integration both within and between organizations. We complete the enrichment of BPEL with context in compliance with Aspect-Oriented Programming (AOP) principles in terms of aspect injection, activation, and execution and through a set of policies. While context and policies are used separately in different SOA initiatives[2, 3], we examine in this paper their role in designing and developing SDs. This role is depicted using a multi-level architecture that supports the orchestration of DSs and ITSs in response to changes detected in the context. Three levels are identified namely executive, business, and resource. The role of policies and context in this architecture is as follows:
Context-Aware Data and IT Services Collaboration in E-Business
93
• The trend towards context-aware, adaptive, and on-demand computing requires that SDs should respond to changes in the environment. This could happen by letting SDs sense the environment and take actions. • Policies manage and control the participation of DSs and ITSs in SDs to guarantee free-of-conflicts SDs. Conflicts could be related to non-sharable resources and semantic mismatches. Several types of policies will be required so that the particularities of DSs and ITSs are taken into account. The rest of the paper is organized as follows. Section 2 defines some concepts and introduces a running example. Section 3 presents the multi-level architecture for service (ITSs and DSs) collaboration and outlines the specification of these services. Section 4 introduces the adaptation of services based on aspect as well as the role of policies first, in managing the participation of DSs and ITSs in SDs and second, in controlling the aspect injection within SDs. Prior to concluding in Section 6, related work is reported in Section 5.
2 Background 2.1 Definitions IT Service. The term IT services is used very often nowadays, even though not always with the same meaning. Existing definitions range from the very generic and allinclusive to the very specific and restrictive. Some authors in [4, 5] define IT services as a software application accessible to other applications over the Web. Another definition is provided by [6], which says that an IT service is provided by an IT system or the IT department (respectively an external IT service provider) to support business processes. The characteristics of IT services can vary significantly. They can comprise single software components as well as bundles of software components, infrastructure elements, and additional services. These additional services are usually information services, consulting services, training services, problem solving services, or modification services. They are provided by operational processes (IT service processes) within the IT department or the external service provider [7]. In this paper, we consider that IT services should be responsible for representing and implementing business processes in compliance with SOA principles. An IT service corresponds to a functional representation of a real-life business activity having a meaningful effect to end-users. Current practices suggest that IT services could be obtained by applying IT SOA methods such as SAP NetWeaver and IBM WebSphere. These methods bundle IT software and infrastructure and offer them as Web services with standardized and well-defined interfaces. We refer to Web services that result out of the application of such methods as enterprise IT Web services or simply IT services. Data Service is, in a recent report from Forrester Research, “an information service (i.e., data service) provides a simplified, integrated view of real-time, high-quality information about a specific business entity, such as a customer or product. It can be provided by middleware or packaged as an individual software component. The information that it provides comes from a diverse set of information resources, including
94
K. Boukadi et al.
operational systems, operational data stores, data warehouses, content repositories, collaboration stores, and even streaming sources in advanced cases’’ [8]. Another definition suggests that a data service is “a form of Web service, optimized for the real-time data integration demands of SOA. Data services virtualize data to decouple physical and logical locations and therefore avoid unnecessary data replication. Data services abstract complex data structures and syntax. Data services federate disparate data into useful composites. Data services also support data integration across both SOA and non-SOA applications” [9]. Data services can be seen as a new class of services that sits between service-based applications and enterprises’ data sources. By doing so, the complexity of accessing these sources is minimized, which lets application developers focus on the application logics of the solutions to develop. These data sources are the basis of different business objects such as customer, order, and invoice. Business object is “the representation of a thing active in the business domain, including at least its business name and definition, attributes, behavior, relationships, and constraints” (OMG’s Business Object Management Special Interest Group). Another definition by the Business Object Management Architecture (jeffsutherland.com/oopsla97/marshall.html) suggests that a business object could be defined through the concepts of purpose, process, resource, and organization. A purpose is about the rationale of a business. A process illustrates how this purpose is reached through a series of dependent activities. A resource is a computing platform upon which the activities of a process are executed. Finally, an organization manages resources in terms of maintenance, access rights, etc. Policies are primarily used to express the actions to take in response to the occurrence of some events. According to [10], policies are “information which can be used to modify the behavior of a system”. Another definition suggests that policies are “external, dynamically modifiable rules and parameters that are input to a system so that this latter can then adjust to administrative decisions and changes in the execution environment” [5]. In the Web services field, policies are treated as rules and constraints that specify and control the behavior of a Web service upon invocation or participation in composition. For example, a policy determines when a Web service can be invoked, what constraints are put on the inputs a Web service expects, how a Web service can be substituted by another in case of failure, etc. According to [11], policies could be at two levels. At the higher level, policies monitor the execution progress of a Web service. At the lower level, policies address issues like how Web services communicate and what information is needed to enable comprehensive data exchange. Context “… is not simply the state of a predefined environment with a fixed set of interaction resources. It is part of process of interacting with an ever-changing environment composed of reconfigurable, migratory, distributed, and multi-scale resources”[12]. In the field of Web services, context facilitates the development and deployment of context-aware Web services. Standard Web services descriptions are then, enriched with context details and new frameworks to support this enrichment are developed [13]. Aspect Oriented Programming. AOP is a paradigm that captures and modularizes concerns that crosscut a software system into modules called Aspects. Aspects can be
Context-Aware Data and IT Services Collaboration in E-Business
95
integrated dynamically into a system using the dynamic weaving principle [14]. In AOP, unit of modularity is introduced using aspects that contain different code fragments (known as advice) and location descriptions (known as pointcuts) that identify where to plug the code fragment. These points, which can be selected using pointcuts, are called join points. The most popular Aspect language is Java-based AspectJ [15]. 2.2 Running Example/Motivating Scenario Our running example is about a manufacturer of plush toys that gets extremely busy with orders during Christmas time. When an order is received, the first step consists of requesting from suppliers the different components that contribute to the production of the plush toys as per an agreed time frame. When the necessary components are received, the assembly operations begin. Finally, the manufacturer selects a logistic company to deliver these products by the due date. In this scenario, the focus is on the delivery service only. Let us assume an inter-enterprise collaboration is established between the manufacturer (service consumer) and a logistic enterprise (service provider). This latter delivers parcels from the manufacturer’s warehouse to a specific location (Fig. 1 - step (i)). If there are no previous interactions between these two bodies, the logistic enterprise verifies the shipped merchandise. Upon verification approval, putting merchandise in parcels service is immediately invoked. This one uses a data service known as parcel service, which checks the number of parcels to deliver. Putting merchandise in parcels service is followed by delivery price and computing delivery price data services. The delivery price data service retrieves the delivery price that corresponds to the manufacturer order based on the different enterprise business objects it has access to such as toy (e.g., size of toy), customer (e.g., discount for regular customers), and parcel (e.g., size of parcel). Finally, the merchandise is transported to the specified location at the delivery due date. The delivery service is considered as a SD that orchestrates four IT services and two data services: picking merchandise, verifying merchandise, putting merchandise in parcels, delivery price, computing delivery price, and delivering merchandise. Fig.1 depicts a graph-based orchestration schema (for instance a BPEL process) of the delivery service.
Fig. 1. The delivery service internal process
96
K. Boukadi et al.
The inter-enterprise collaboration between the manufacturer and the logistic enterprise raises the importance of establishing dynamic (not pre-established) contacts since different enterprise logistics exist. To this end, the orchestration schema of the SD should be aware of the contexts of both manufacturer and nature of collaboration. Additional details can easily affect the progress of any type collaboration. For example, if the manufacturer is located outside the country, some security controls should be added and price calculation should be reviewed. Thus, the SD orchestration schema should be enhanced with contextual information that triggers changes in the process components (ITSs and DSs) in a timely manner. In this case, environmental context such as weather conditions (e.g., snow storm, heavy rain) may affect the IT service "putting merchandise in parcels". Consequently, several actions should be anticipated to avoid the deterioration of the merchandise by using metal instead of regular cardboard boxes. Besides, the participation of the different ITSs and DSs in the delivery SD should be managed and controlled in order to guarantee that the obtained SD is free-ofconflicts. ITSs and DSs belong to different IT departments, each with its own characteristics, rules, and constraints. As a result, effective mechanisms that would ensure and regulate ITSs and DSs interaction are required. Besides, these mechanisms should also capture changes in context and guarantee the adaptation of service domains’ behaviors in order to accommodate the situation in which they are going to operate.
3 Multi-level Architecture for Service Collaboration In this section, the multi-level architecture that supports the orchestration of DSs and ITSs during the exercise of developing SDs is presented in terms of concepts, duties per layer, and service specifications. 3.1 Service Domain Concept The rationale of the service domain concept is to abstract at a higher-level the integration of a large number of ITSs and DSs. A service domain is built upon (in fact, it uses existing standards such as WSDL, SOAP, and UDDI) and enhances the Web service concept. It does not define new application programming interfaces or standards, but hides the complexity of this exercise by facilitating service deployment and self-management. Fig. 2 illustrates the idea of developing inter-enterprise business processes using several service domains. A SD involves ITSs and DSs in two types of orchestration schemas: vertical and horizontal. In the former, a SD controls a set of ITSs, which themselves controls a set of DSs. In the latter, a SD controls both ITSs and DSs at the same time. These two schemas offer the possibility of applying different types of control over the services whether data or IT. Businesses have to address different issues, so they should be given the opportunity to do so in different ways. DSs and ITSs participation in either type of orchestration is controlled through a set of policies, which permit to guarantee that SDs are free-of-conflicts. More details on policy use are given in subsections 4.2 and 4.3.2.
Context-Aware Data and IT Services Collaboration in E-Business
97
Fig. 2. Inter-enterprise collaboration based on service domains
According to the running example, the delivery SD orchestrates four IT services and two data services: picking merchandise, verifying merchandise, putting merchandise in parcels, delivery price, computing delivery price, and delivering merchandise. Keeping these ITSs and DSs in one place facilitates manageability and avoids extra composition work on the client side as well as exposing non-significant services like "Verifying merchandise" on the enterprise side.
Fig. 3. Multi-layer architecture
98
K. Boukadi et al.
The multi-level architecture in Fig.3 operates in a top-down way. It starts from executive-layer level, goes through the resource and business levels. These layers are described in detail in the following. 3.2 Roles and Duties per Layer Executive layer. In this layer, a SD consists of an Entry Module (EM), a Context Manager Module (CMM), a Service Orchestration Module (SOM), and an Aspect Activator Module (AAM). In Fig. 3, CMM, SOM, and AAM provide external interfaces to the SD. The EM is SOAP-based to receive users’ requests and return responses. In addition to these requests, the EM supports the administration of a SD. For example, an administrator can send a register command to add a new ITS to a given SD after signing up this ITS in the corresponding ITS registry. The register command can also be used to add a new orchestration schema to the orchestration schemas registry. When the EM receives a user’s request, it screens the orchestration schemas registry to select a suitable orchestration schema for this request and identify the best ITSs and DSs. The selection of this schema and other services takes into account the customer context (detailed in section 4.1). Afterwards, the selected orchestration schema is delivered to the SOM, which is basically an orchestration engine based on BPEL [16]. The SOM presents an external interface called Execution Control Interface (ECI) that lets user obtain information about the status of a SD that is under execution. This interface is very useful in case of external collaboration as it ensures the monitoring of the SD progress. This is a major difference between a SD and a regular Web service. In fact, with the ECI a SD is based on the Glass box principle, which is opposite to the black box principle. In the glass box the SD way-of-doing is visible to the environment and mechanisms are provided to monitor the execution progress of the SD. Contrarily, a Web service is seen as a black box piece of functionality: described by its message interface and has no internal process structure that is visible to its environment. Finally, the final external interface known as Context Detection Interface (CDI) is used by the CMM to detect and catch changes in context so that SD adaptability is guaranteed. This happens by selecting and injecting the right aspect with respect to the current context change. To this end, a SD uses the AAM to identify a suitable aspect for the current situation so that the AAM injects this aspect into the BPEL process. Resource Layer. It consists of two sub-layers namely source and service instances. •
The sources sub-layer is populated with different registries namely orchestration schemas, ITS, DS, context, and aspect. The orchestration schemas registry consists of a set of abstract processes based on ITSs and DSs, which are at a later stage implemented as executable processes. In addition, this sub-layer includes a set of data sources that DSs use for functioning. The processing of these requests is subject to access privileges that could be set by different bodies in the enterprise such as security administrators. The content of the data sources evolves over time following data sources addition, withdrawal, or modification. According
Context-Aware Data and IT Services Collaboration in E-Business
•
99
to the running example, customer and inventory databases are examples of data sources. The services instances sub-layer is populated with a set of instance services that originate from ITSs and DSs. On the one hand, DS instances collect data from BOs and not from data sources. This permit to shield DSs from semantic issues and changes in data sources and to prepare data from disparate BOs that are scattered cross the enterprise. The data that a DS collects could have different recipients including SDs, ITSs, or other DSs. Basically, a DS crawls the business-objects level looking for the BOs it needs. A DS consults the states that the BOs are currently in to collect the data that are reported in these states. This collection depends on the access rights (either public or private) that are put on data; some data might not be available to DSs. According to the running example, a DS could be developed to track the status of the parcels included in the delivery process (i.e., number of used parcels). This DS would have to access order and update parcel BOs. If order BO takes now on orderChecked state, the DS will know the parcels that are confirmed for inclusion in the delivery and hence, interact with the right ITS to update the parcel BO. This is not the case if this BO was still in orderUpdated state; some items might not have been confirmed yet. On the other hand, ITS instances implement business processes that characterize enterprises’ day-to-day activities. ITSs need the help of DSs and BOs for the data they produce and host, respectively. Update means, here, make a BO take on a new state, which is afterwards reflected on some data sources by updating their respective data. This is not the case with DSs that consult BOs only.
Business Layer. It (1) tracks the BOs that are developed and deployed according to the profile of the enterprise and (2) identifies the capabilities of each BO. In [17] we suggest that BOs should exhibit a goal-driven behavior instead of just responding to stimuli. The objective is to let BOs (i) screen the data sources that have the data they need, (ii) resolve data conflicts in case they raise, and (iii) inform other BOs about their capabilities of data source access and data mediation. As mentioned earlier, order, parcel, and customer are examples of BOs. These BOs are related to each other, e.g., an order is updated only upon inspection of customer record and order status. The data sources that these BOs access could be customer and inventory databases. 3.3 Layer Dependencies Executive, resource, and business layers are connected to each other through a set of dependencies. We distinguish two types of dependencies: intra-layer dependencies and inter-layers dependencies. 1. Intra-layer dependencies: Within the resource layer two types of intra-layer dependencies are identified:
100
K. Boukadi et al.
• Type and instance dependency: we differentiate between the ITSs types that are published in the IT services registry, which are groups of similar (in term of functionality) IT services, and the actual IT service instances that are made available for invocation. To illustrate the complexity of the dependencies that arises, we suggest hereafter a simple illustration of the number of IT service instances that could be obtained out of ITSs. Let us consider ITSt = {ITSt1, . . ., ITStα} a set of α IT service types and ITSi = {ITSi1, . . ., ITSiβ} a set of β service instances that exist in the services instances sub-layer. The mapping of S onto I is subjective and one-to-many. Assuming each IT service type ITst1 has N instantiations, β =N×α. The same dependency exists between DS types in the sources sub-Layer and data services instances in the services instances sub-layer. •
2.
Composition dependency involves orchestration schemas in two sublayers namely source and service instance. Composition illustrates today’s business processes that are generally developed using distributed and heterogeneous modules for instance services. For an enterprise business process, it is critical to identify the services that are required, the data that these services require, make sure that these services collaborate, and develop strategies in case conflicts occur.
Inter-layers dependencies •
•
Access dependency involves the service instances sub-layer and the business layer. Current practices expose services directly to data sources, which could hinder the reuse opportunities of these services. The opposite is here adopted by making services interact with BOs for their needs of data. For a service, it is critical to identify the BOs it needs, comply with the access privileges of these BOs, and identify the next services that it will interact with after completing an orchestration process. Access dependency could be of different types namely on-demand, periodic, or event-driven. In the on-demand case, services submit requests to BOs when needed. In the periodic case, services submit requests to BOs according to a certain agreed-upon plan. Finally, in the event-driven case services submit requests to BOs in response to some event. Invocation dependency involves the business and source layers. An invocation implements the technical mechanisms that allow a BO access the available data sources in terms of consultation or update (see Fig. 4). A given BO includes a set of operations: o A set of read methods, which provide various ways to retrieve and return one or more instances of the data included in a data source. o A set of write methods, responsible for updating (inserting, modifying, deleting) one or more instances of the data included in the data sources. o A set of navigation methods, responsible for traversing relationships from one data source to one or more data of a second data source. For example, the Customer BO can have two navigation methods getDelivOrder and getElecOrder, to fetch for a given customer the delivery orders from a delivery database and electronic orders from electronic order database.
Context-Aware Data and IT Services Collaboration in E-Business
101
Fig. 4. Invocation dependency between the business objects and the data sources
3.4 Services Specifications 3.4.1 Data Service Specification DSs come along with a good number of benefits that would smooth the development of enterprise SD: Access unification: data related to a BO might be scattered across independent data sources that could present three kinds of heterogeneities: o
o
Model heterogeneity: each data source has its own data model or data format (relational tables, WSDL with XML schema, XML documents, flat files, etc.). Interface heterogeneity: each type of data source has its own programming interface; JDBC/SQL for relational databases, REST/SOAP for Web services, file I/O calls, and custom APIs for packaged or homegrown datacentric applications (like BAPI for SAP) .
The adoption of DSs relieves SOA application developers from having to directly cope with the first two forms of heterogeneity. That is, in the field of Web services all data sources are described using WSDL and invoked via REST or SOAP calls (which means having the same interface), and all data are in XML form and described using XML Schema (which means having the same data model). Reuse and agility: the value-added of SOA to application development is reuse and agility, but without flexibility at the data tier, this value-added could quickly erode. Instead of relying on non-reusable proprietary codes to access and manipulate data in monolithic application silos, DSs can be used and reused in multiple business processes. This simplifies the development and maintenance of service-oriented applications and introduces easy-to-use capabilities to use information in dynamic and real-time processes. To define DSs, we took into account the interactions that should take place between the DSs and BOs. DSs are given access to BOs for consultation purposes. Each DS represents a specific data-driven request whose satisfaction requires the participation of several BOs. The following suggests an example of itemStatusOrder DS whose role is to confirm the status of the items to include in a customer's order.
102
K. Boukadi et al. Table 1. DS service structure
In the above structure, the following arguments are used: 1. Input argument identifies the elements that need to be submitted to a DS. These elements could be obtained from different parties such as users and other DSs. 2. Output argument identifies the elements that a DS returns after its processing is complete. 3. Method argument identifies the actions that a DS implements in response to the access requests it runs over the different BOs. Because DSs could request sensitive data from BOs, we suggest in Section 4.2.1 that appropriate policies should be developed so that data misuse cases are avoided. We refer to these policies as privacy. 3.4.2 IT Service Specification For the IT service specification, we follow the one proposed by Papazoglou and Heuvel in [18] who specify an IT service as (1) a structural specification that defines service types, messages, port types, (2) a behavioral specification that defines service Table 2. IT service structure
Context-Aware Data and IT Services Collaboration in E-Business
103
operations, effects, and side effects of service operations, and (3) a policy specification that defines the policy assertions and constraints on the service. Based on this specification, we propose in our work a set of policies as follows: − Business policies correspond to policy specification. − Behavior policy corresponds to structural specification. − Privacy policies correspond to behavioral specification.
4 Services Collaboration 4.1 Context-Aware Orchestration The concept of context appears in many disciplines as a meta-information that characterizes the specific situation of an entity, to describe a group of conceptual entities, partition a knowledge base into manageable sets or as a logical construct to facilitate reasoning services [19]. The categorization of context is critical for the development of adaptable applications. Context includes implicit and explicit inputs. For example, user context can be deducted in an implicit way by the service provider such as in pervasive environment using physical or software sensors. Explicit context is determined precisely by entities that the context involves. Nevertheless, despite the various attempts to suggest a context categorization, there is no proper categorization. Relevant information differs from one domain to another and depends on their effective use [20]. In this paper, we propose an OWL-based context categorization in Fig. 4. This categorization is dynamic as new sub-categories can be added at any time. Each context definition belongs to a certain category, which can be related to provider, customer, and collaboration.
Fig. 5. Ontology for categories of context
104
K. Boukadi et al.
In the following, we explain the different concepts that constitute our ontology-based model for context categorization: − Provider-related context deals with the conditions under which providers can offer their SDs. For example, performance attributes including some metrics to measure a service quality: time, cost, QoS, and reputation. These attributes are used to model the competition between providers. − Customer-related context represents the set of available information and metadata used by service providers to adapt their services. For example, a customer profile permits to characterize a user. − Collaboration-related context represents the context of the business opportunity. We identify three sub-categories: location, time, and business domain. The location and time represent the geographical location and the period of time within which the business opportunity should be accomplished. 4.2 Policy Specification As stated earlier, policies are primarily used to first, govern the collaboration between ITSs, DSs, and SDs and second, reinforce specific aspects of this collaboration such as when an ITS accepts to take part in a SD and when a DS rejects a data request from an ITS because of risk of access right violation. Because of the variety of these aspects, we decompose policies into different types and dissociate policies from the business logics that services implement. Any change in a policy should “slightly’’ affect a service’s business logic and vice versa. 4.2.1 Types of Policies Policies might be imposed by different types of initiators like the service itself, service provider, and user who plans to use the service [21]. − Service driven policy is defined by the individual organizational that offer services. This description is not enough. − Service flow driven policy is defined by the organizations offering a composite web service. − Customer driven policy is meant to future consumers of services. Generally, a user has various preferences in selecting a particular service, and these preferences have to be taken into account during composition or even during other steps like section selection, composition, and execution. For example, if two providers have two services with the same functionality, the user would like to consider the cheapest. Policies are used in different application domains such as telecommunication, learning, just to cite a few, which supports the rationale of developing different types of policies. In this paper, we suggest the following types based on some of our previous works [11, 22]: − Business policy defines the constraints that restrict the completion of a business process and determines how this process should be executed according to users’ requirements and organizations’ internal regulations. For example, a car loan
Context-Aware Data and IT Services Collaboration in E-Business
105
application needs to be treated within 48 hours and a bank account should maintain a minimum balance. − Behavior policy supports the decisions that a service (ITS and DS) has to make when it receives a request from a DS to be part of the orchestration schema that this associated with this DS. In [22], we defined three behaviors namely permission, dispensation, and restriction, which we continue to use in this paper. Additional details on these behaviors are given later. − Privacy policy safeguards against the cases of data misuse by different parties with focus here on DSs and ITSs that interact with BOs. For example, an ITS needs to have the necessary credentials to submit an update request to a BO. Credentials of an ITS could be based on the history of submitting similar request and reputation level. Fig. 5 illustrates how the three behaviors of a service (DS or ITS) are related to each other based on the execution outcome of behavior policies [22]. In this figure, dispensation (P) and dispensation(R) stand for dispensation related to permission and related to restriction, respectively. In addition, engagement (+) and engagement (-) stand for positive and negative engagement in a SD, respectively. • Permission: a service accepts to take part in a service domain upon validation of its current commitments in other service domains. • Restriction: a service does not wish to take part in a service domain for various reasons such as inappropriate rewards or lack of computing resources. • Dispensation means that a service breaks either a permission or a restriction of engagement in a service domain. In the former case, the service refuses to engage despite the positive permission that is granted. This could be due to the Permission no
yes
Engagement(-)
DispensationP yes
no
Engagement(-)
Restriction yes
no
DispensationR
Engagement(+)
no
Engagement(-)
yes
Engagement(+)
Fig. 6. Behaviors associated with a service
106
K. Boukadi et al.
unexpected breakdown of a resource upon which the service performance was scheduled. In the latter case, the service does engage despite the restrictions that are detected. The restrictions are overridden because of the priority level of the business scenario that the service domain implements, which requires an immediate handling of this scenario. In [11], Maamar et al. report that several types of policy specification languages exist. The selection of a policy specification language is guided by some requirements that need to be satisfied [23]: expressiveness to support the wide range of policy requirements arising in the system being managed, simplicity to ease the policy definition tasks for people with various levels of expertise, enforceability to ensure a mapping of policy specification into concrete policies for various platforms, scalability to guarantee adequate performance, and analyzability to allow reasoning about and over policies. In this paper we adopt WSPL is used. WSPL syntax is based on the OASIS eXtensible Access Control Markup Language (XACML) standard (www.oasisopen.org/committees/download.php/2406/oasis-xacml-1.0.pdf). The Listing.1 suggests a specification of a behavior policy with focus on privacy in WSPL. It shows an example of an ITS that checks the minimum age and income of a person prior to approving a car loan application.
Listing. 1. A behavior policy specification for an ITS
The Listing.2 suggests a specification of a business policy in WSPL. It shows an example of a DS that checks the possibility of taking part in a service domain. In addition to the arguments that form WSPL-defined policies, we added additional arguments for the purpose of tracking the execution of these policies. These additional arguments are as follows: − Purpose: describes the rationale of developing a policy P. − Monitoring authority: identifies the party that checks the applicability of a policy P so that the outcomes of this policy are reinforced. A service provider or policy developer could illustrate these parties.
Context-Aware Data and IT Services Collaboration in E-Business
107
Listing. 2. A business policy specification for a DS
− Scope (local or global): identifies the parties that are involved in the execution of a policy P. “Local” means that the policy involves a specific services, where “global” means that the policy involves different services. − Side-effect: describes the policies that could be triggered following the completion of policy P. − Restriction: limits the applicability of a policy P according to different factors such as time (e.g., business hours) and location (e.g., departments affected by policy P performance). 4.3 Service Domain Adaptability Using Aspects In the following we describe how we define and implement a context adaptive service domain using AOP. 4.3.1 Rationale of AOP AOP is based on two arguments. First, AOP enables crosscutting concerns, which is crucial to separate context information from the business logic. For example, in Delivery Service Domain, an aspect related to the calculation of extra fees could be defined in case there is a change in the delivery date. Second, AOP promotes the dynamic weaving principle. Aspects are activated and deactivated at runtime. Consequently, a BPEL process can be dynamically altered upon request. For the needs of SD adaptation, we suggest the following improvements in the existing AOP techniques: runtime activation of aspects in the BPEL process to enable dynamic adaptation according to context changes, and aspects selection to enable customer-specific contextualization of the Service Domain. 4.3.2 Using Policies to Express Contexts Modeling context is a crucial issue that needs to be addressed to assist context-aware applications. By context modeling we mean the language that will be used to define both service and enterprise collaboration contexts. Since, there is a diversity of contextual information, we find several context modeling languages such as ConteXtML [24], contextual schemas [25], CxBR (context-based reasoning) [26], and CxG (contextual graphs) [27]. These languages provide the means for defining context in specific application domains such as pervasive and mobile computing. All these representations have
108
K. Boukadi et al.
strengths and weaknesses. As stated in [28], lack of generality is the most frequent drawback: usually, each representation is suited for only a specific type of application and expresses a narrow vision of the context. Consequently, they present little or no support for defining context in Web service based collaboration scenarios. In this paper, we model the different types of context based on policy. Relation between context and policies is depicted in the definitions below: Definition 1. A service Context Ctxt is a pair (Ctxt-name, P) where Ctxt-name corresponds to the context name derived from the context ontology (Fig) and P is the policy related to the given context Ctxt. Let P-set= {P , P , …, P } denotes the set of 1
2
n
policies and SCx= {Cx , Cx ,…, Cx } the set of context properties related to a par1
2
n
ticular ITS or DS. We express the mapping between ITS or data service’ contexts and policies with the mapping function MFs: SCxÆP-set which gives the policies related to a given ITS or Data service. Definition 2. A customer context Custxt is a pair (Custxt -name, P) where Custxt-name corresponds to the context name derived from the context ontology (Fig) and P is the policy related to the given context Custxt. Let P-set= {P1, P2, …, Pn} denotes the set of policies and CCx= {Cx1, Cx2,…, Cxn} the set of context properties related to a particular customer. Same as the definition 1, we define a mapping function which retrieves the set of policies relating to a given customer context: MFc: CCx ÆP-set. In these definitions, context is described with policies. Consequently, to express context we need to express at first policies. We introduce the specification of context (customer, collaboration, and service contexts) in WSPL. Introducing the context concept in WSPL comes from the need to specify certain constraints that can depend on the environment in which the customer, the service, and the business collaboration are operational. For instance, a customer context that depicts a security requirement can be specified as follows.
Listing. 3. A customer context specified as a policy
4.3.3 Controlled Aspect Injection through Policies We show how policies and AOP can work hand-in-hand Policies related to customer and collaboration contexts are used to control the aspect injection within a SD. A SD provides a set of adaptation actions that are context dependent. We implement these actions as a set of aspects in order not to create any invasive code in the functional service implementation. An aspect includes a pointcut that matches a given ITS or
Context-Aware Data and IT Services Collaboration in E-Business
109
data service and one or more advices. These advices refer to the context dependent adaptation actions of this service. Advices are defined as Java methods and pointcuts are specified in XML format. Our implementation approach for the controlled aspect injection through policies is presented in Fig.7. In this figure, the Aspect Activator Module previously presented in Fig.3, includes the Aspect Manager Module (AMM), the Matching Module (MM), and the Weaver Module (WM). − The AMM is responsible for adding new aspects to a corresponding aspect registry. In addition, the AMM can deal with a new advice implementation, which could be added to this registry. The aspect registry contains the method names of the different advices related to a given ITS or data service. − The MM is the cornerstone of the proposed aspect injection approach. It receives matching requests from the AMM and returns one or a list of matched aspects. − The WM is based on an AOP mechanism known as weaving. The WM performs a run time weaving, which consists of injecting an advice implementation into the core logic of an ITS or data service. The control of an aspect injection into a DS or ITS is as follows. Once a context dependent IT service or Data service operation is reached, the Context Manager Module sends the AAM the service’s ID and its context dependent operation’ ID. Then, the AMM identifies the set of aspects that can be executed to the ITS or DS based on the information sent by the Context Manager Module (i.e., service’s ID and Operation’s ID) (action 1 and 2). The set of aspects as well as the customer policies are
Fig. 7. Controlled aspect injection through policies
110
K. Boukadi et al.
transmitted to Matching Module which returns the aspects that match the customer and the collaborations policies. The matching module is based on a matching algorithm and uses domain ontology. Finally, the WM integrates the advice implementation into the core logic of the service. By doing so, the service will execute the appropriate aspect in response to the current context (customer and collaboration contexts). For illustration purposes, consider a payment ITS which is aware of the past interactions with customers. For loyal customer, the credit card payment is accepted, but bank transfer is required for new customers. Hence, the payment operation depends on the customer context, i.e., loyal or new one. The context dependent behaviors of the payment ITS are exposed as a set of aspects. Three of them are depicted in Listing.4.
///Aspect 1
///Aspect 2
Context-Aware Data and IT Services Collaboration in E-Business
111
///Aspect 3 Listing. 4. The three aspects related to the payment ITS
For example, the advice of Aspect 1 is expressed as a Java class, which is executed instead of the operation captured by the pointcut (line 9). The join point, where the advice is weaved, is the payment operation (line 10). The pointcuts are expressed as a condition If Customer ="Loyal" (i.e., "past interaction=Yes") the advice uses the credit card number, in order to perform the customer payment. Consider now the customer related context which specifies a security requirement, which is previously described. Based on this requirement, when executing the payment service, the matching module will determine that only aspect 3 with secured transaction should be applied. This aspect is then transmitted to the weaver module in order to be injected in the payment service.
5 Related Work In this work, we identify two types of works related to our proposal: on the one hand, those proposals that, come from the data engineering field and propose approaches for data service modeling and development; and, on the other hand, those ones that focus specially in the adaptation of ITS (Web services). Data services & the SOA software industry. Data services have gained considerable attention from SOA software industry leaders over the last three years. Many products are currently offered or being developed to make the creation of Data services easier than ever, to cite a few, AquaLogic by BEA Systems [29], Astoria by Microsoft [30], MetaMatrix by RedHat [31], Composite Software [9], Xcalia [32], and IBM [33]. The products offered here integrate the enterprise’s data sources and provide a uniform access to data through Data services. As a representative example,
112
K. Boukadi et al.
AquaLogic BEA’s data service is a collection of functions that all have a common output schema, accept different sets of parameters, and are implemented via individual XQuery expressions. In a simplified example, a Data Service exports a set of functions returning Customer objects where one function takes as input the customer’s last name, another one her city and state, and so on. AquaLogic exports these Data services to SOA application developers as Data Web Services, where functions become operations. In Microsoft’s Astoria project a data service or ADO.NET data services is a RESTbased framework that allows releasing data via flexible data services and well-known industry standards (JSON and Atom). As opposed to message-oriented frameworks like SOAP-based services, REST-based services use basic HTTP requests (GET, POST, PUT and DELETE) to perform CRUD standing for Create, Read, Update and Delete operations. Such query patterns allow navigating through data, following the links established with the data schema. For example, /Customers ('PKEY1')/Orders (1)/Employees returns the employees that created sales order 1 for the customer with a key of 'PKEY1. (Source: http://msdn.microsoft.com/en-us/library/cc907912.aspx) In addition, most commercial databases products incorporate mechanisms to export database functionalities as Data Web services. Representative examples are the IBM Document Access Definition Extension (DADX) technology (Db2XMLextender1) and the Native XML Web Services for Microsoft SQL Server 2005[34]. DADX is part of the IBM DB2 XML Extender, an XML/relational mapping layer, and facilitates the development of Web services on top of relational databases that can, among other things, execute SQL queries and retrieve relational data as XML. Web services adaptation. Regarding the adaptation of Web services according to context changes [35]; [36] many ongoing research have been released. In the proposed work, we focus specially on the adaptation of a process. Some research efforts from the Workflow community address the need for adaptability. They focus on formal methods to make the workflow process able to adapt to changes in the environment conditions. For example, authors in [37] propose eFlow with several constructs to achieve adaptability. The authors use parallel execution of multiple equivalent services and the notion of generic service that can be replaced by a specific set of services at runtime. However, adaptability remains insufficient and vendor specific. Moreover, many adaptation triggers, like infrastructure changes, considered by workflow adaptation are not relevant for Web services because services hide all implementation details and only expose interfaces described in terms of types of exchanged messages and message exchange patterns. In addition, authors in [38] extend existing process modeling languages to add context sensitive regions (i.e., parts of the business process that may have different behaviors depending on context). They also introduce context change patterns as a mean to identify the contextual situations (and especially context change situations) that may have an impact on the behavior of a business process. In addition, they propose a set of transformation rules that allow generating a BPEL based business process from a context sensitive business process. However, context change patterns which regulate the context changes are specific to their running example with no-emphasis on proposing more generic patterns. 1
Go online to http://www.306.ibm.com/software/data/db2/extenders/xmlext/
Context-Aware Data and IT Services Collaboration in E-Business
113
There are a few works using an Aspect based adaptability in BPEL. In [39], the authors presented an Aspect oriented extension to BPEL: the AO4BPEL which allows dynamically adaptable BPEL orchestration. The authors combine business rules modeled as Aspects with a BPEL orchestration engine. When implementing rules, the choice of the pointcut depends only on the activities (invoke, reply or sequence). Business rules in this work are very simple and do not express a pragmatic adaptability constraint like context change in our case. Another work is proposed in [40] in which the authors propose a policy-driven adaptation and dynamic specification of Aspects to enable instance specific customization of the service composition. However, they do not mention how they can present the aspect advices or how they will consider the pointcuts.
6 Conclusion In this paper, was presented a multi-level architecture that supports the design and development of a high-level type of service known as Service Domain. This one orchestrates a set of related ITSs and DSs. Service Domain enhances the Web service concept to tackle the challenges that E-Business collaboration poses. In addition, to address enterprise adaptability to context changes, we made Service Domain sensible to context. We enhanced BPEL execution with AOP mechanisms. We have shown that AOP enables crosscutting and context-sensitive logic to be factored out of the service orchestration and modularized into Aspects. Last but not least, we illustrated the role of policies and context in a Service Domain. Different types of policies were proposed and then used first, to manage the participation of DSs and ITSs in SDs and second, to control aspect injection within the SD. In term of future work, we plan to complete the SD multi-level architecture and conduct a complete empirical study of our approach.
References 1. Carey, M., et al.: Integrating enterprise information on demand with xQuery. XML Journal 2(6/7) (2003) 2. Yang, S.J.H., et al.: A new approach for context aware SOA. In: Proc. e-Technology, eCommerce and e-Service, EEE 2005, pp. 438–443 (2005) 3. Gorton, S., et al.: StPowla: SOA, Policies and Workflows. In: Book StPowla: SOA, Policies and Workflows. Series StPowla: SOA, Policies and Workflows, pp. 351–362 (2007) 4. Arsanjani, A.: Service-oriented modeling and architecture (2004), http://www.ibm.com/developerworks/library/ws-soa-design1/ 5. Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design, p. 792. Prentice Hall, Englewood Cliffs (2005) 6. Huang, Y., et al.: A Service Management Framework for Service-Oriented Enterprises. In: Proceedings of the IEEE International Conference on E-Commerce Technology (2004) 7. Braun, C., Winter, R.: Integration of IT Service Management into Enterprise Architecture. In: Proc. The 22th Annual ACM Symposium on Applied Computing, SAC 2007 (2007)
114
K. Boukadi et al.
8. Gilpin, M., Yuhanna, N.: Information-As-A-Service: What’s Behind This Hot New Trend? (2007), http://www.forrester.com/Research/Document/Excerpt/ 0,7211,41913,00.html 9. C. Software, SOA Data Services Solutions, technical report (2008), http://compositesoftware.com/solutions/soa.html 10. Lupu, E., Sloman, M.: Conflicts in Policy-Based Distributed Systems Management. IEEE Transactions on Software Engineering 25(6) (1999) 11. Zakaria, M., et al.: Using policies to manage composite Web services. IT Professional 8(5) (2006) 12. Coutaz, J., et al.: Context is key. Communications of the ACM 48(3) (2005) 13. Keidl, M., Kemper, A.: A Framework for Context-Aware Adaptable Web Services (Demonstration). In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 826–829. Springer, Heidelberg (2004) 14. AOP, Aspect-Oriented Software Development (2007), http://www.aosd.net 15. AspectJ, The AspectJ Programming Guide (2007), http://dev.eclipse.org/viewcvs/indextech.cgi/~ checkout~aspectj-home/doc/progguide/index.html 16. Andrews, T., et al.: Business Process Execution Language for Web Services (2003), http://www.ibm.com/developerworks/library/specification/ ws-bpel/ 17. Maamar, Z., Sutherland, J.: Toward Intelligent Business Objects. Communications of the ACM 43(10) 18. Papazoglou, M.P., Heuvel, W.-J.v.d.: Service-oriented design and development methodology. International Journal of Web Engineering and Technology (IJWET) 2(4), 412–442 (2006) 19. Benslimane, D., Arara, A., Falquet, G., Maamar, Z., Thiran, P., Gargouri, F.: Contextual Ontologies: Motivations, Challenges, and Solutions. In: Yakhno, T., Neuhold, E.J. (eds.) ADVIS 2006. LNCS, vol. 4243, pp. 168–176. Springer, Heidelberg (2006) 20. Mostefaoui, S.K., Mostefaoui, G.K.: Towards A Contextualisation of Service Discovery and Composition for Pervasive Environments. In: Proc. the Workshop on Web-services and Agent-based Engineering (2003) 21. Dan, A.: Use of WS-Agreement in Job Submission (September 2004) 22. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services composition. Data & Knowledge Engineering 62(2) (2007) 23. Damianou, N., Dulay, N., Lupu, E.C., Sloman, M.: The ponder policy specification language. In: Sloman, M., Lobo, J., Lupu, E.C. (eds.) POLICY 2001. LNCS, vol. 1995, pp. 18–38. Springer, Heidelberg (2001) 24. Ryan, N.: ConteXtML: Exchanging contextual information between a Mobile Client and the FieldNote Server, http://www.cs.kent.ac.uk/projects/mobicomp/fnc/ ConteXtML.html 25. Turner, R.M.: Context-mediated behavior for intelligent agents. Human-Computer studies 48(3), 307–330 (1998) 26. Gonzales, A.J., Ahlers, R.: Context-based representation of intelligent behavior in training simulations. International Transactions of the Society for Computer Simulation, 153–166 (1999)
Context-Aware Data and IT Services Collaboration in E-Business
115
27. Brezillon, P.: Context-based modeling of operators’ Practices by Contextual Graphs. In: Proc. 14th Mini Euro Conference in Human Centered Processes (2003) 28. Bucur, O., et al.: What Is Context and How Can an Agent Learn to Find and Use it When Making Decisions? In: Proc. international workshop of central and eastern europe on multi agent systems, pp. 112–121 (2005) 29. Carey, M.: Data delivery in a service-oriented world: the BEA aquaLogic data services platform. In: Proc. The 2006 ACM SIGMOD international conference on Management of data (2006) 30. C. Microsoft, ADO.NET Data Services (also known as Project Astoria) (2007), http://astoria.mslivelabs.com/ 31. Hat, R.: MetaMatrix Enterprise Data Services Platform (2007), http://www.redhat.com/jboss/platforms/dataservices/ 32. X. Inc, Xcalia Data Access Services (2009), http://www.xcalia.com/products/xcalia-xdasdata-access-service-SDO-DAS-data-integration-through-web-services.jsp 33. Williams, K., Daniel, B.: SOA Web Services - Data Access Service. Java Developer’s Journal (2006) 34. Microsoft, Native XML Web services for Microsoft SQL server (2005), http://msdn2.microsoft.com/en-us/library/ms345123.aspx 35. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services composition. Data & Knowledge Engineering 62(2), 327–351 (2007) 36. Bettini, C., et al.: Distributed Context Monitoring for the Adaptation of Continuous Services. World Wide Web 10(4), 503–528 (2007) 37. Casati, F., Ilnicki, S., Jin, L., Krishnamoorthy, V., Shan, M.-C.: Adaptive and Dynamic Service Composition in eFlow. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, p. 13. Springer, Heidelberg (2000) 38. Modafferi, S., et al.: A Methodology for Designing and Managing Context-Aware Workflows. In: Mobile Information Systems II, pp. 91–106 (2005) 39. Charfi, A., Mezini, M.: AO4BPEL: An Aspect-oriented Extension to BPEL. World Wide Web 10(3), 309–344 (2007) 40. Erradi, A., et al.: Towoards a Policy-Driven Framework For Adaptive Web Services Composition. In: Proceedings of the International Conference on Next Generation Web Services Practices 2005, pp. 33–38 (2005)
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines Javier C´ amara1 and Alfred Kobsa2 1
Department of Computer Science, University of M´ alaga Campus de Teatinos, 29071. M´ alaga, Spain
[email protected] 2 Dept. of Informatics, University of California, Irvine Bren School of Information and Computer Sciences, Irvine, CA 92697, USA
[email protected]
Abstract. Controlled online experiments in which envisaged changes to a website are first tested live with a small subset of site visitors have proven to predict the effects of these changes quite accurately. However, these experiments often require expensive infrastructure and are costly in terms of development effort. This paper advocates a systematic approach to the design and implementation of such experiments in order to overcome the aforementioned drawbacks by making use of Aspect-Oriented Software Development and Software Product Lines.
1
Introduction
During the past few years, e-commerce on the Internet has experienced a remarkable growth. For online vendors like Amazon, Expedia and many others, creating a user interface that maximizes sales is thereby crucially important. Different studies [11,10] revealed that small changes at the user interface can cause surprisingly large differences in the amount of purchases made, and even minor difference in sales can make a big difference in the long run. Therefore, interface modifications must not be taken lightly but should be carefully planned. Experience has shown that it is very difficult for interface designers and marketing experts to foresee how users react to small changes in websites. The behavioral difference that users exhibit at Web pages with minimal differences in structure or content quite often deviates considerably from all plausible predictions that designers had initially made [22,30,27]. For this reason, several techniques have been developed by industry that use actual user behavior to measure the benefits of design modifications [17]. These techniques for controlled online experiments on the Web can help to anticipate users’ reactions without putting a company’s revenue at risk. This is achieved by implementing and studying the effects of modifications on a tiny subset of users rather than testing new ideas directly on the complete user base. Although the theoretical foundations of such experiments have been well established, and interesting practical lessons compiled in the literature [16], the A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 116–135, 2009. c Springer-Verlag Berlin Heidelberg 2009
Facilitating Controlled Tests of Website Design Changes
117
infrastructure required to implement such experiments is expensive in most cases and does not support a systematic approach to experimental variation. Rather, the support for each test is usually crafted for specific situations. In this work, we advocate a systematic approach to the design and implementation of such experiments based on Software Product Lines [7] and Aspect Oriented Software Development (AOSD) [12]. Section 2 provides an overview of the different techniques involved in online tests, and Section 3 points out some of their shortcomings. Section 4 describes our systematic approach to the problem, giving a brief introduction to software product lines and AOSD. Section 5 introduces a prototype tool that we developed to test the feasibility of our approach. Section 6 compares our proposal to currently available solutions, and Section 7 presents some conclusions and future work.
2
Controlled Online Tests on the Web: An Overview
The underlying idea behind controlled online tests of a Web interface is to create one or more different versions of it by incorporating new or modified features, and to test each version by presenting it to a randomly selected subset of users in order to analyze their reactions. User response is measured along an overall evaluation criterion (OEC) or fitness function, which indicates the performance of the different versions or variants. A simple yet common OEC in e-commerce is the conversion rate, that is, the percentage of site visits that result in a purchase. OECs may however also be very elaborate, and consider different factors of user behavior. Controlled online experiments can be classified into two major categories, depending on the number of variables involved:
Fig. 1. Checkout screen: variants A (original, left) and B (modified, right)1
1
c 2007 ACM, Inc. Included by permission.
118
J. C´ amara and A. Kobsa
– A/B, A/B/C, ..., A/../N split testing: These tests compare one or more variations of a single site element or factor, such as a promotional offer. Site developers can quickly see which variation of the factor is most persuasive and yields the highest conversion rates. In the simplest case (A/B test), the original version of the interface is served to 50% of the users (A or Control Group), and the modified version is served to the other 50% (B or Treatment Group2 ). While A/B tests are simple to conduct, they are often not very informative. For instance, consider Figure 1, which depicts the original version and a variant of a checkout example taken from [11].3 This variant has been obtained by modifying 9 different factors. While an A/B test tells us which of two alternatives is better, it does not yield reliable information on how combinations of the different factors influence the performance of the variant. – Multivariate testing: A multivariate test can be viewed as a combination of many A/B tests, whereby all factors are systematically varied. Multivariate testing extends the effectiveness of online tests by allowing the impact of interactions between factors to be measured. A multivariate test can, e.g., reveal that two interface elements yield an unexpectedly high conversion rate only when they occur together, or that an element that has a positive effect on conversion loses this effect in the presence of other elements. The execution of a test can be logically separated into two steps, namely (a) the assignment of users to the test, and to one of the subgroups for each of the interfaces to be tested, and (b) the subsequent selection and presentation of this interface to the user. The implementation of online tests partly blurs the two different steps. The assignment of users to different subgroups is generally randomized, but different methods exist such as: – Pseudo-random assignment with caching: consists in the use of a pseudo-random number generator coupled with some form of caching in order to preserve consistency between sessions (i.e., a user should be assigned to the same interface variant on successive visits to the site); and – Hash and partitioning: assigns a unique user identifier that is either stored in a database or in a cookie. The entire set of indentifiers is then partitioned, and each partition is assigned to a variant. This second method is usually preferred due to scalability problems with the first method. Three implementation methods are being used for the selection and presentation of the interface to the user: 2
3
In reality, the treatment group will only comprise a tiny fraction of the users of a website, so as to keep losses low if the conversion rate of the treatment version should turn out to be poorer than that of the existing version. Eisenberg reports that Interface A resulted in 90% fewer purchases, probably because potential buyers who had no promotion code were put off by the fact that others could get lower prices.
Facilitating Controlled Tests of Website Design Changes
119
– Traffic splitting: In order to generate the different variants, different implementations are created and placed on different physical or virtual servers. Then, by using a proxy or a load balancer which invokes the randomization algorithm, a user’s traffic is diverted to the assigned variant. – Server-side selection: All the logic which invokes the randomization algorithm and produces the different variants for users is embedded in the code of the site. – Client-side selection: Assignment and generation of variants is achieved through dynamic modification of each requested page at the client side using JavaScript.
3
Problems with Current Online Test Design and Implementation
The three implementation methods discussed above entail a number of disadvantages, which are a function of the choices made at the architectural level and not of the specific characteristics of an online experiment (such as the chosen OEC or the interface features being modified): – Traffic splitting: Although traffic splitting does not require any changes to the code in order to produce the different user assignments to variants, the implementation of this approach is relatively expensive. The website and the code for the measurement of the OEC have to be replicated n times, where n is the number of tested combinations of different factors (number of possible variants). In addition to the complexity of creating each variant for the test manually by modifying the original website’s code (impossible in the case of multivariate tests involving several factors), there is also a problem associated to the hardware required for the execution of the test. If physical servers are used, a fleet of servers will be needed so that each of the variants tested will be hosted on one of them. Likewise, if virtual servers are being used, the amount of system resources required to acommodate the workload will easily exceed the capacity of the physical server, requiring the use of several servers and complicating the supporting infrastructure. – Server-side selection: Extensive code modification is required if interface selection and presentation is performed at the server side. Not only has randomization and user assignment to be embedded in the code, but also a branching logic has to be added in order to produce the different interfaces corresponding to the different combinations of variants. In addition, the code may become unnecessarily complex, particularly if different combinations of factors are to be considered at the same time when tests are being run concurrently. However, if these problems are solved, server-side selection is a powerful alternative which allows deep modifications to the system and is cheap in terms of supporting infrastructure. – Client-side selection: Although client-side selection is to some extent easier to implement than server-side selection, it suffers from the same shortcomings. In addition, the features subject to experimentation are far more
120
J. C´ amara and A. Kobsa
limited (e.g., modifications which go beyond the mere interface are not possible, JavaScript must be enabled in the client browser, execution is errorprone, etc.). Independent of the chosen form of implementation, substantial support for systematic online experimentation at a framework level is urgently needed. The framework will need to support the definition of the different factors and their possible combinations at the test design stage, and their execution at runtime. Being able to evolve a site safely by keeping track of each of the variants’ performance as well as maintaining a record of the different experiments is very desirable when contrasted with the execution of isolated tests on an ad-hoc basis.
4
A Systematic Approach to Online Test Design and Implementation
To overcome the various limitations described in the previous section, we advocate a systematic approach to the development of online experiments. For this purpose, we rely on two different foundations: (i) software product lines provide the means to properly model the variability inherent in the design of the experiments, and (ii) aspect-oriented software development (AOSD) helps to reduce the effort and cost of implementing the variants of the test by capturing variation factors on aspects. The use of AOSD will also help in presenting variants to users, as well as simplifying user assignment and data collection. By combining these two foundations we aim at supplying developers with the necessary tools to design tests in a systematic manner, enabling the partial automation of variant generation and the complete automation of test deployment and execution. 4.1
Test Design Using Software Product Lines
Software Product Line models describe all requirements or features in the potential variants of a system. In this work, we use a feature-based model similar to the models employed by FODA [13] or FORM [14]. This model takes the form of a lattice of parent-child relationships which is typically quite large. Single systems or variants are then built by selecting a set of features from the model. Product line models allow the definion of the directly reusable (DR) or mandatory features which are common to all possible variants, and three types of discriminants or variation points, namely: – Single adaptors (SA): a set of mutually exclusive features from which only one can be chosen when defining a particular system. – Multiple adaptors (MA): a list of alternatives which are not mutually exclusive. At least one must be chosen. – Options (O): a single optional feature that may or may not be included in a system definition.
Facilitating Controlled Tests of Website Design Changes
121
F1(MA) The cart component must include a checkout screen. – F1.1(SA) There must be an additional “Continue Shopping” button present. • F1.1.1(DR) The button is placed on top of the screen. • F1.1.2(DR) The button is placed at the bottom of the screen. – F1.2(O) There must be an “Update” button placed under the quantity box. – F1.3(SA) There must be a “Total” present. • F1.3.1(DR) Text and amount of the “Total” appear in different boxes. • F1.3.2(DR) Text and amount of the “Total” appear in the same box. – F1.4(O) The screen must provide discount options to the user. • F1.4.1(DR) There is a “Discount” box present, with amount in a box next to it on top of the “Total” box. • F1.4.2(DR) There is an “Enter Coupon Code” input box present on top of “Shipping Method”. • F1.4.3(DR) There must be a “Recalculate” button left of “Continue Shopping.”
Fig. 2. Feature model fragment corresponding to the checkout screen depicted in Figure 1
In order to define the different interface variants that are present in an online test, we specify all common interface features as DR features in a product line model. Varying elements are modeled using discriminants. Different combinations of interface features will result in different interface variants. An example for such a feature model is given in Figure 2, which shows a fragment of a definition of some of the commonalities and discriminants of the two interface variants depicted in Figure 1. Variants can be manually created by the test designer through the selection of the desired interface features in the feature model, or automatically by generating all the possible combinations of feature selections. Automatic generation is especially interesting in the case of multivariate testing. However, it is worth noting that not all combinations of feature selections need to be valid. For instance, if we intend to generate a variant which includes F1.3.1 in our example, that same selection cannot include F1.3.2 (single adaptor). Likewise, if F1.4 is selected, it is mandatory to include F1.4.1-F1.4.3 in the selection. These restrictions are introduced by the discriminants used in the product line model. If restrictions are not satisfied, we have generated an invalid variant that should not be presented to users. Therefore, generating all possible feature combinations for a multivariate test is not enough for our purposes. Fortunately, the feature model can be easily translated into a logical expression by using features as atomic propositions and discriminants as logical connectors. The logical expression of a feature model is the conjunction of the logical expressions for each of the sub-graphs in the lattice and is achieved using logical AND. If Gi and Gj are the logical expressions for two different sub-graphs, then the logical expression for the lattice is:
122
J. C´ amara and A. Kobsa
Gi ∧ Gj Parent-child dependency is expressed using a logical AND as well. If ai is a parent requirement and aj is a child requirement such that the selection of aj is dependent on ai then ai ∧ aj . If ai also has other children ak . . . az then: ai ∧ (ak ∧ . . . ∧ az ) The logical expression for a single adaptor discriminant is exclusive OR. If ai and aj are features such that ai is mutually exclusive to aj then ai ⊕ aj . Multiple adaptor discriminants correspond to logical OR. If ai and aj are features such that at least one of them must be chosen then ai ∨ aj . The logical expression for an option discriminant is a bi-conditional4 . If ai is the parent of another feature aj then the relationship between the two features is ai ↔ aj . Table 1 summarizes the relationships and logical definitions of the model. The general expression for a product line model is G1 ∧ G2 ∧ . . . ∧ Gn where Gi is ai R aj R ak R . . . R an and R is one of ∧, ∨, ⊕, or ↔. The logical expression for the checkout example feature model shown in Figure 2 is: F 1 ∧ ( F 1.1 ∧ (F 1.1.1 ⊕ F 1.1.2) ∨ F 1.2 ∨ F 1.3 ∧ (F 1.3.1 ⊕ F 1.3.2) ∨ F 1.4 ↔ (F 1.4.1 ∧ F 1.4.2 ∧ F 1.4.3) ) By instantiating all the feature variables in the expression to true if selected, and false if unselected, we can generate the set of possible variants and then test their validity using the algorithm described in [21]. A valid variant is one for which the logical expression of the complete feature model evaluates to true. Table 1. Feature model relations and equivalent formal definitions Feature Model Relation Sub-graph Dependency Single adaptor Multiple adaptor Option
Formal Definition Gi ∧ Gj ai ∧ aj ai ⊕ aj ai ∨ aj ai ↔ aj
Manual selection can also benefit from this approach since the test administrator can be guided in the process of feature selection by pointing out inconsistencies in the resulting variant as features are selected or unselected. Figure 3 depicts the feature selections for variants A and B of our checkout example. In the feature model, mandatory features are represented with black circles, whereas options are represented with white circles. White triangles express alternative (single adaptors), and black triangles multiple adaptors. 4
ai ↔ aj is true when ai and aj have the same value.
Facilitating Controlled Tests of Website Design Changes
(F1) Checkout Screen
Variant A (Original)
(F1.1) Continue Button
(F1.1.1) Placed Top
123
(F1.2) Update Button
(F1.3) Total Display
(F1.3.1) Split Box
(F1.3.2) Same Box
(F1.1.2) Placed Bottom
(F1.4) Discount
(F1.4.1) Discount Box
(F1.4.3) Recalculate Button
(F1.4.2) Coupon Code Box
Variant B
(F1) Checkout Screen
(F1.1) Continue Button
(F1.1.1) Placed Top
(F1.2) Update Button
(F1.3) Total Display
(F1.3.1) Split Box
(F1.3.2) Same Box
(F1.1.2) Placed Bottom
(F1.4) Discount
(F1.4.1) Discount Box
(F1.4.3) Recalculate Button
(F1.4.2) Coupon Code Box
Fig. 3. Feature selections for the generation of variants A and B from Figure 1
As regards automatic variant generation, we must bear in mind that full factorial designs (i.e., testing every possible combination of interface features) provides the greatest amount of information about the individual and joint impacts of the different factors. However, obtaining a statistically meaningful number of cases for this type of experiment takes time, and handling a huge number of variants aggravates this situation. In our approach, the combinatorial explosion in multivariate tests is dealt with by bounding the parts of the hierarchy which descend from an unselected feature. This avoids the generation of all the variations derived from that specific part of the product line. In addition, our approach does not confine the test designer to a particular selection strategy. It is possible to integrate any optimization method for reducing the complexity of full factorial designs, such as for instance hill climbing strategies like the Taguchi approach [28]. 4.2
Case Study: Checkout Screen
Continuing with the checkout screen example described in Section 1, we introduce a simplified implementation of the shopping cart in order to illustrate our approach. We define a class ‘shopping cart’ (Cart) that allows for the addition and removal of different items (see Figure 4). This class contains a number of methods that render the different elements in the cart at the interface level, such as printTotalBox() or printDiscountBox(). These are private class methods called from within the public method printCheckoutTable(), which is intended
124
J. C´ amara and A. Kobsa
General Cart +printHeader () +printBanner () +printMenuTop () +printMenuBottom ()
Item -Id -name -price
1
*
1
1
-shippingmethod -subtotal -tax -total +addItem() +removeItem() -printDiscountBox() -printTotalBox() -printCouponCodeBox() -printShippingMethodBox() -recalculateButton() -continueShoppingButton() +printCheckoutTable() +doCheckout()
User -name 1 1 -email -username -password
Fig. 4. Classes involved in the shopping cart example
to render the main body of our checkout screen. A user’s checkout is completed when doCheckout() is invoked. On the other hand, the General class contains auxiliary functionality, such as representing common elements of the site (e.g., headers, footers and menus). 4.3
Implementing Tests with Aspects
Aspect-Oriented Software Development (AOSD) is based on the idea that systems are better programmed by separately specifying their different concerns (areas of interest), using aspects and a description of their relations with the rest of the system. Those specifications are then automatically woven (or composed) into a working system. This weaving process can be performed at different stages of the development, ranging from compile-time to run-time (dynamic weaving) [26]. The dynamic approach (Dynamic AOP or d-AOP) implies that the virtual machine or interpreter running the code must be aware of aspects and control the weaving process. This represents a remarkable advantage over static AOP approaches, considering that aspects can be applied and removed at run-time, modifying application behaviour during the execution of the system in a transparent way. With conventional programming techniques, programmers have to explicitly call methods available in other component interfaces in order to access their functionality, whereas the AOSD approach offers implicit invocation mechanisms for behavior in code whose writers were unaware of the additional concerns (obliviousness). This implicit invocation is achieved by means of join points. These are regions in the dynamic control flow of an application (method calls or executions, exception handling, field setting, etc.) which can be intercepted by an aspect-oriented program by using pointcuts (predicates which allow the quantification of join points) to match with them. Once a join point has been matched, the program can run the code corresponding to the new behavior
Facilitating Controlled Tests of Website Design Changes
125
(advices) typically before, after, instead of, or around (before and after) the matched join point. In order to test and illustrate our approach, we use PHP [25], one of the predominant programming languages in Web-based application development. It is an easy to learn language specifically designed for the Web, and has excellent scaling capabilities. Among the variety of AOSD options available for PHP, we have selected phpAspect [4], which is to our knowledge the most mature implementation so far, providing AspectJ5 -like syntax and abstractions. Although there are other popular languages and platforms available for Web application development (Java Servlets, JSF, etc.), most of them provide similar abstractions and mechanisms. In this sense, our proposal is technology-agnostic and easily adaptable to other platforms. Aspects are especially suited to overcome many of the issues described in Section 3. They are used for different purposes in our approach that will be described below. Variant implementation. The different alternatives that have been used so far for variant implementation have important disadvantages, which we discussed in Section 3. These detriments include the need to produce different versions of the system code either by replicating and modifying it across several servers, or using branching logic on the server or client sides. Using aspects instead of the traditional approaches offers the advantage that the original source code does not need to be modified, since aspects can be applied as needed, resulting in different variants. In our approach, each feature described in the product line is associated to one or more aspects which modify the original system in a particular way. Hence, when a set of features is selected, the appropriate variant is obtained by weaving with the base code6 the set of aspects associated to the selected features in the variant, modifying the original implementation. To illustrate how these variations are achieved, consider for instance the features labeled F1.3.1 and F1.3.2 in Figure 2. These two features are mutually exclusive and state that in the total box of the checkout screen, text and amount should appear in different boxes rather than in the same box, respectively. In the original implementation (Figure 1.A), text and amount appeared in different boxes, and hence there is no need to modify the behavior if F1.3.1 is selected. When F1.3.2 is selected though, we merely have to replace the behavior that renders the total box (implemented in the method Cart.printTotalBox()). We achieve this by associating an appropriate aspect to this feature. In Listing 1, by defining a pointcut that intercepts the execution of the total box rendering method, and applying an around-type advice, we are able to replace the method through which this particular element is being rendered at the interface. This approach to the generation of variants results in better code reusability (especially in multivariate testing) as well as reduced costs and efforts, since 5 6
AspectJ [9,15] is the de-facto standard in aspect-oriented programming languages. That is, the code of the original system.
126
J. C´ amara and A. Kobsa
Listing 1. Rendering code replacement aspect aspect replaceTotalBox{ pointcut render:exec(Cart::printTotalBox(*)); around(): render{ /* Alternative rendering code */ } }
developers do not have to replicate nor generate complete variant implementations. In addition, this approach is safer and cleaner since the system logic does not have to be temporally (nor manually) modified, thus avoiding the resulting risks in terms of security and reliability. Finally, not only interface modifications such as the ones depicted in Figure 1, but also backend modifications are easier to perform, since aspect technology allows a behavior to be changed even if it is scattered throughout the system code. The practical implications of using AOP for this purpose can be easily seen in an example. Consider for instance Amazon’s recommendation algorithm, which is invoked in many places throughout the website such as its general catalog pages, its shopping cart, etc. Assume that Amazon’s development team wonders whether an alternative algorithm that they developed would perform better than the original. With traditional approaches they could modify the source code only by (i) replicating the code on a different server and replacing all the calls7 made to the recommendation algorithm, or (ii) including a condition contingent on the variant that is being executed in each call to the algorithm. Using aspects instead enables us to write a simple statement (pointcut) to intercept every call to the recommendation algorithm throughout the site, and replace it with the call to the new algorithm. Experimenting with variants may require going beyond mere behavior replacement though. This means that any given variant may require for its implementation the modification of data structures or method additions to some classes. Consider for instance a test in which developers want to monitor how customers react to discounts on products in a catalog. Assume that discounts can be different for each product and that the site has not initially been designed to include any information on discounts, i.e., this information needs to be introduced somewhere in the code. To solve this problem we can use intertype declarations. Aspects can declare members (fields, methods, and constructors) that are owned by other classes. These are called inter-type members. As can be observed in Listing 2, we introduce an additional discount field in our item class, and also a getDiscountedPrice() method which will be used whenever the discounted price of an item is to be retrieved. Note that we need to 7
In the simplest case, only the algorithm’s implementation would be replaced. However, modifications on each of the calls may also be required, e.g., due to differences in the signature with respect to the original algorithm’s implementation,.
Facilitating Controlled Tests of Website Design Changes
127
Listing 2. Item discount inter-type declarations aspect itemDiscount{ private Item::$discount; public function Item::getDiscountedPrice(){ return ($this->price - $this->discount); } }
introduce a new method, because it should still be possible to retrieve the original, non-discounted price. Data Collection and User Interaction. The code in charge of measuring and collecting data for the experiment can also be written as aspects in a concise manner. Consider a new experiment with our checkout example in which we want to calculate how much customers spend on average when they visit our site. To this end, we need to add up the amount of money spent on each purchase. One way to implement this functionality is again inter-type declarations. Listing 3. Data collection aspect aspect accountPurchase{ private $dbtest; pointcut commitTrans:exec(Cart::doCheckout(*)); function Cart::accountPurchase(DBManager $db){ $db->insert($this->getUserName(), $this->total); } around($this): commitTrans{ if (proceed()){ $this->accountPurchase($thisAspect->dbtest); } } }
When the aspect in Listing 3 intercepts the method that completes a purchase (Cart.doCheckout()), the associated advice inserts the sales amount into a database that collects the results from the experiment (but only if the execution of the intercepted method succeeds, which is represented by proceed() in the advice). It is worth noting that while the database reference belongs to the aspect, the method used to insert the data belongs to the Cart class. Aspects permit the easy and consistent modification of the methods that collect, measure, and synthesize the OEC from the gathered data to be presented to the test administrator in order to be analyzed. Moreover, data collection procedures do not need to be replicated across the different variants, since the system will weave this functionality across all of them.
128
J. C´ amara and A. Kobsa
User Assignment. Rather than implementing user assignment in a proxy or load balancer that routes requests to different servers, or including it in the implementation of the base system, we experimented with two different alternatives of aspect-based server-side selection: – Dynamic aspect weaving: A user routing module acts as an entry point to the base system. This module assigns the user to a particular variant by looking up what aspects have to be woven to produce the particular variant to which the current user had been assigned. The module then incorporates these aspects dynamically upon each request received by the server, flexibly producing variants in accordance with the user’s assignment. Although this approach is elegant and minimizes storage requirements, it does not scale well. Having to weave a set of aspects (even if they are only a few) on the base system upon each request to the server is very demanding in computational terms, and prone to errors in the process. – Static aspect weaving: The different variants are computed offline, and each of them is uploaded to the server. In this case the routing module just forwards the user to the corresponding variant stored on the server (the base system is treated just like another variant for the purpose of the experiment). This method does not slow down the operation of the server and is a much more robust approach to the problem. The only downside of this alternative is that the code corresponding to the different variants has to be stored temporarily on the server (although this is a minor inconvenience since usually the amount of space required is negligible compared to the average server storage capacity). Furthermore, this alternative is cheaper than traffic splitting, since it does not require the use of a fleet of servers nor the modification of the system’s logic. This approach still allows one to spread the different variants across several servers in case of high traffic load.
5
Tool Support
The approach for online experiments on websites that we presented in this article has been implemented in a prototype tool, called WebLoom. It includes a graphical user interface, to build and visualize feature models that can be used as the structure upon which controlled experiments on a website can be defined. In addition, the user can write aspect code which can be attached to the different features. Once the feature model and associated code have been built, the tool supports both automatic and manual variant generation, and is able to deploy aspect code which lays out all the necessary infrastructure to perform the designed test on a particular website. The prototype has been implemented in Python, using the wxWidgets toolkit technology for the development of the user interface. It both imports and exports simple feature models described in an XML format specific to the tool. The prototype tool’s graphical user interface is divided into three main working areas:
Facilitating Controlled Tests of Website Design Changes
129
Fig. 5. WebLoom displaying the product line model depicted in Figure 2
– Feature model. This is the main working area where the feature model can be specified (see Figure 5). It includes a toolbar for the creation and modification of discriminants and a code editor for associated modifications. This area also allows the selection of features in order to generate variants. – Variant management. Variants generated on the site model area can be added or removed from the current test, renamed or inspected. A compilation of the description of all features contained in a variant is automatically presented to the user based on feature selections when the variant is selected (Figure 6, bottom). – Overall Estimation Criteria. One or more OEC to measure on the experiments can be defined in this section. Each of the OEC are labeled in order to be identified later on, and the associated code for gathering and processing data is directly defined by the test administrator. In Figure 7, we can observe the interaction with our prototype tool. The user enters a description of the potential modifications to be performed on the website, in order to produce the different variants under WebLoom’s guidance. This results in a basic feature model structure, which is then enriched with code associated to the aforementioned modifications (aspects). Once the feature model is complete, the user can freely select a number of features using the interface,
130
J. C´ amara and A. Kobsa
Fig. 6. Variant management screen in WebLoom 1. Design
2. Aspect Code Generation
3. Aspect Weaving
WebLoom 1.a. Specify Feature Model
Aspect Code for Variants 1..n
Test Implementation
Weaver
1.b. Add Feature Code
Designer
1.c Define Variants 1..n (by Selecting Features )
System Logic Data Collection Aspect Code
1.d Define OECs
Fig. 7. Operation of WebLoom
and take snapshots of the current selections in order to generate variants. These variants are automatically checked for validity before being incorporated into the variant collection. Alternatively, the user can ask the tool to generate all the valid variants for the current feature model and then remove the ones which are not interesting for the experiment. Once all necessary input has been received, the tool gathers the code for each particular variant to be tested in the experiment, by collecting all the aspects associated with the features that were selected for the variant. It then invokes
Facilitating Controlled Tests of Website Design Changes
131
the weaver to produce the actual variant code for the designed test by weaving the original system code with the collection of aspects produced by the tool.
6
Related Work
Software product lines and feature-oriented design and programming have already been successfully applied in the development of Web applications, to significantly boost productivity by exploiting commonalities and reusing as many assets (including code) as possible. For instance, Trujillo et al. [29] present a case study of Feature Oriented Model Driven Design (FOMDD) on a product line of portlets (Web portal components). In this work, the authors expressed variations in portlet functionality as features, and synthesized portlet specifications by composing them conveniently. Likewise, Petersson and Jarzabek [24] present an industrial case study in which their reuse technique XVCL was incrementally applied to generate a Web architecture from the initial code base of a Web portal. The authors describe the process that led to the development of the Web Portal product line. Likewise, aspect-oriented software development has been previously applied to the development of Web applications. Valderas et al. [31] present an approach for dealing with crosscutting concerns in Web applications from requirements to design. Their approach aims at decoupling requirements that belong to different concerns. These are separately modeled and specified using the task-based notation, and later integrated into a unified requirements model that is the source of a model-to-model and model-to-code generation process yielding Web application prototypes that are built from task descriptions. Although the aforementioned approaches meet their purpose of boosting productivity by taking advantage of commonalities, and of easing maintenance by properly encapsulating crosscutting concerns, they do not jointly exploit the advantages of both approaches. Moreover, although they are situated in the context of Web application development, they are not well suited to the specific characteristics of online test design and implementation which have been described in previous sections. The idea of combining software product lines and aspect-oriented software development techniques does already have some tradition in software engineering. In fact, Lee et al. [18] present some guidelines on how feature-oriented analysis and aspects can be combined. Likewise, Loughran and Rashid [19] propose framed aspects as a technique and methodology that combines AOSD, frame technology, and feature-oriented domain analysis in order to provide a framework for implementing fine-grained variability. In [20], they extend this work to support product line evolution using this technique. Other approaches such as [32] aim at implementing variability, and the management and tracing of requirements for implementation by integrating model-driven and aspect-oriented software development. The AMPLE project [1] takes this approach one step further along the software lifecycle and maintenance, aiming at traceability during product line evolution. In the particular context of Web applications, Alf´erez and
132
J. C´ amara and A. Kobsa
Suesaowaluk [8] introduce an aspect-oriented product line framework to support the development of software product lines of Web applications. This framework is similarly aimed at identifying, specifying, and managing variability from requirements to implementation. Although both the aforementioned approaches and our own proposal employ software product lines and aspects, there is a key difference in the way these elements are used. First, the earlier approaches are concerned with the general process of system construction by identifying and reusing aspect-oriented components, whereas our approach deals with the specific problem of online test design and implementation, where different versions of a Web application with a limited lifespan are generated to test user behavioral response. Hence, our framework is intended to generate lightweight aspects which are used as a convenient means for the transient modification of parts of the system. In this sense, it is worth noting that system and test designs and implementations are completely independent of each other, and that aspects are only involved as a means to generate system variants, but not necessarily present in the original system design. In addition, our approach provides automatic support for the generation of all valid variants within the product line, and does not require the modification of the underlying system which stays online throughout the whole online test process. To the extent of our knowledge, no research has so far been reported on treating online test design and implementation in a systematic manner. A number of consulting firms already specialized on analyzing companies’ Web presence [2,6,3]. These firms offer ad-hoc studies of Web retail sites with the goal of achieving higher conversion rates. Some of them use proprietary technology that is usually focused on the statistical aspects of the experiments, requiring significant code refactoring for test implementation8 . Finally, SiteSpect [5] is a software package which takes a proxy-based approach to online testing. When a Web client makes a request to the Web server, it is first received by the software and then forwarded to the server (this is used to track user behavior). Likewise, responses with content are also routed through the software, which injects the HTML code modifications and forwards the modified responses to the client. Although the manufacturers claim that it does not matter whether content is generated dynamically or statically by the server since modifications are performed by replacing pieces of the generated HTML code, we find this approach adequate for trivial changes to a site only, and not very suitable for user data collection and measurement. Moreover, no modifications can be applied to the logic of the application. These shortcomings severely impair this method which is not able to go beyond simple visual changes to the site.
7
Concluding Remarks
In this paper, we presented a novel and systematic approach to the development of controlled online tests for the effects of webpage variants on users, based 8
It is however not easy to thoroughly compare these techniques from an implementation point of view, since firms tend to be quite secretive about them.
Facilitating Controlled Tests of Website Design Changes
133
on software product lines and aspect oriented software development. We also described how the drawbacks of traditional approaches, such as high costs and development effort, can be overcome with our approach. We believe that its benefits are especially valuable for the specific problem domain that we address. On one hand, testing is performed on a regular basis for websites in order to continuously improve their conversion rates. On the other hand, a very high percentage of the tested modifications are usually discarded since they do not improve the site performance. As a consequence, a lot of effort is lost in the process. We believe that WebLoom will save Web developers time and effort by reducing the amount of work they have to put into the design and implementation of online tests. Although there is a wide range of choices available for the implementation of Web systems, our approach is technology-agnostic and most likely deployable to different platforms and languages. However, we observed that in order to fully exploit the benefits of this approach, a website should first be tested whether its implementation meets the modularity principle. This is of special interest at the presentation layer, where user interface component placement, user interface style elements, event declarations and application logic traditionally tend to be mixed up [23]. Regarding future work, a first perspective aims at enhancing our basic prototype with additional WYSIWYG extensions for its graphical user interface. Specifically, developers should be enabled to immediately see the effects that code modifications and feature selections will have on the appearance of their website. This is intended to help them deal with variant generation in a more effective and intuitive manner. A second perspective is refining the variant validation process so that variation points in feature models that are likely to cause significant design variations can be identified, thus reducing the variability.
References 1. 2. 3. 4. 5. 6. 7.
Ample project, http://www.ample-project.net/ Offermatica, http://www.offermatica.com/ Optimost, http://www.optimost.com/ phpAspect: Aspect oriented programming for PHP, http://phpaspect.org/ Sitespect, http://www.sitespect.com Vertster, http://www.vertster.com/ Software product lines: practices and patterns. Addison-Wesley Longman Publishing Co., Boston (2001) 8. Alf´erez, G.H., Suesaowaluk, P.: An aspect-oriented product line framework to support the development of software product lines of web applications. In: SEARCC 2007: Proceedings of the 2nd South East Asia Regional Computer Conference (2007) 9. Colyer, A., Clement, A., Harley, G., Webster, M.: Eclipse AspectJ: Aspect-Oriented Programming with AspectJ and the Eclipse AspectJ Development Tools. Pearson Education, Upper Saddle River (2005) 10. Eisenberg, B.: How to decrease sales by 90 percent, http://www.clickz.com/1588161
134
J. C´ amara and A. Kobsa
11. Eisenberg, B.: How to increase conversion rate 1,000 percent, http://www.clickz.com/showPage.html?page=1756031 12. Filman, R.E., Elrad, T., Clarke, S., Aksit, M. (eds.): Aspect-Oriented Software Development. Addison-Wesley, Reading (2004) 13. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, S.: Feature-oriented domain analysis (FODA) feasibility study. Technical Report CMU/SEI-90-TR-21, Software Engineering Institute, Carnegie Mellon University (November 1990) 14. Kang, K.C., Kim, S., Lee, J., Kim, K., Shin, E., Huh, M.: FORM: A featureoriented reuse method with domain-specific reference architectures. Ann. Software Eng. 5, 143–168 (1998) 15. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An Overview of AspectJ. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–353. Springer, Heidelberg (2001) 16. Kohavi, R., Henne, R.M., Sommerfield, D.: Practical Guide to Controlled Experiments on the Web: Listen to your Customers not to the HIPPO. In: Berkhin, P., Caruana, R., Wu, X. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, pp. 959–967. ACM, New York (2007) 17. Kohavi, R., Round, M.: Front Line Internet Analytics at Amazon.com (2004), http://ai.stanford.edu/~ ronnyk/emetricsAmazon.pdf 18. Lee, K., Kang, K.C., Kim, M., Park, S.: Combining feature-oriented analysis and aspect-oriented programming for product line asset development. In: SPLC 2006: Proceedings of the 10th International on Software Product Line Conference, Washington, DC, USA, pp. 103–112. IEEE Computer Society, Los Alamitos (2006) 19. Loughran, N., Rashid, A.: Framed aspects: Supporting variability and configurability for AOP. In: Bosch, J., Krueger, C. (eds.) ICSR 2004. LNCS, vol. 3107, pp. 127–140. Springer, Heidelberg (2004) 20. Loughran, N., Rashid, A., Zhang, W., Jarzabek, S.: Supporting product line evolution with framed aspects. In: Lorenz, D.H., Coady, Y. (eds.) ACP4IS: Aspects, Components, and Patterns for Infrastructure Software, March, pp. 22–26 21. Mannion, M., C´ amara, J.: Theorem proving for product line model verification. In: van der Linden, F.J. (ed.) PFE 2003. LNCS, vol. 3014, pp. 211–224. Springer, Heidelberg (2004) 22. McGlaughlin, F., Alt, B., Usborne, N.: The power of small changes tested (2006), http://www.marketingexperiments.com/improving-website-conversion/ power-small-change.html 23. Mikkonen, T., Taivalsaari, A.: Web applications – spaghetti code for the 21st century. In: Dosch, W., Lee, R.Y., Tuma, P., Coupaye, T. (eds.) Proceedings of the 6th ACIS International Conference on Software Engineering Research, Management and Applications, SERA 2008, Prague, Czech Republic, pp. 319–328. IEEE Computer Society, Los Alamitos (2008) 24. Pettersson, U., Jarzabek, S.: Industrial experience with building a web portal product line using a lightweight, reactive approach. In: Wermelinger, M., Gall, H. (eds.) Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Lisbon, Portugal, pp. 326–335. ACM, New York (2005) 25. PHP: Hypertext preprocessor, http://www.php.net/ 26. Popovici, A., Frei, A., Alonso, G.: A Proactive Middleware Platform for Mobile Computing. In: Endler, M., Schmidt, D. (eds.) Middleware 2003. LNCS, vol. 2672. Springer, Heidelberg (2003)
Facilitating Controlled Tests of Website Design Changes
135
27. Roy, S.: 10 Factors to Test that Could Increase the Conversion Rate of your Landing Pages (2007), http://www.wilsonweb.com/conversion/suman-tra-landing-pages.htm 28. Taguchi, G.: The role of quality engineering (Taguchi Methods) in developing automatic flexible manufacturing systems. In: Proceedings of the Japan/USA Flexible Automation Symposium, Kyoto, Japan, July 9-13, pp. 883–886 (1990) 29. Trujillo, S., Batory, D.S., D´ıaz, O.: Feature oriented model driven development: A case study for portlets. In: Proceedings of the 30th International Conference on Software Engineering (ICSE 2007), Leipzig, Germany, pp. 44–53. IEEE Computer Society, Los Alamitos (2007) 30. Usborne, N.: Design choices can cripple a website (2005), http://alistapart.com/articles/designcancripple 31. Valderas, P., Pelechano, V., Rossi, G., Gordillo, S.E.: From crosscutting concerns to web systems models. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 573–582. Springer, Heidelberg (2007) 32. Voelter, M., Groher, I.: Product line implementation using aspect-oriented and model-driven software development. In: SPLC 2007: Proceedings of the 11th International Software Product Line Conference, Washington, DC, USA, pp. 233–242. IEEE Computer Society, Los Alamitos (2007)
Frontiers of Structured Business Process Modeling Dirk Draheim Central IT Services Department University of Innsbruck
[email protected]
Abstract. In this article we investigate in how far a structured approach can be applied to business process modelling. We try to contribute to a better understanding of the driving forces on business process specifications.
1
Introduction
Isn’t it compelling to apply the structured programming arguments to the field of business process modelling? Our answer to this question is ‘no’. The principle of structured programming emerged in the computer science community. From today’s perspective, the discussion of structured programming rather had the characteristics of a maturing process than the characteristics of a debate, although there have also been some prominent sceptic comments on the unrestricted validity of the structured programming principle. Structured programming is a well-established design principle in the field of program design as the third normal form is in the field of database design. It is common sense that structured programming is better than unstructured programming – or let’s say structurally unrestricted programming – and this is what is taught as foundational knowledge in many standard curricula of many software engineering study programmes. With respect to business process modelling, in practice, you find huge business process models that are arbitrary nets. How comes? Is it somehow due to some lack of knowledge transfer from the programming language community to the information system community? For computer scientists, it might be tempting to state that structured programming is a proven concept and it is therefore necessary to eventually promote a structured business process modelling discipline, however, care must be taken. In this article, we want to contribute to the understanding in how far a structured approach can be applied to business process modelling and in how far such an approach is naive. We attempt to clarify that the arguments of structured programming are about the pragmatics of programming and that they often relied on evidence in the past. Consequentially, our reasoning is at the level of pragmatics of business process modelling. We try to avoid getting lost in superficial comparisons of modelling language constructs but trying to understand the core problems of structuring business process specifications. As an example, A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 136–155, 2009. c Springer-Verlag Berlin Heidelberg 2009
Frontiers of Structured Business Process Modeling
137
so to speak as a taster to our discussion, we take forward one of our arguments here, which is subtle but important, i.e., that there are some diagrams expressing behaviour that cannot be transformed into a structured diagram expressing the same behaviour solely in terms of the same primitives as the original structurally unrestricted diagram. These are all those diagrams that contain a loop which is exited via more than one exit point, which is a known result from literature, encountered [1] by Corrado B¨ohm and Guiseppe Jacopini, proven for a special case [5] by Donald E. Knuth and Robert W. Floyd and proven in general [6] by S. Rao Kosaraju.
2
Basic Definitions
In this Section we explain the notions of program, structured program, flowchart, D-flowchart, structured flowchart, business process model and structured business process model as used in this article. The section is rather on syntactical issues. You might want to skip this Section and use it as a reference, however, you should at least glimpse over the formation rules of structured flowcharts defined in Fig. 1, which are also the basis for structured business process modelling. In the course of this article, programs are imperative programs with go-tostatements, i.e., they consist of basic statements, sequences, case constructs, loops and go-to-statements. Structured programs are those programs that abstain from go-to-statements. In programs with go-to-statements loops do not add to the expressive power, in presence of go-to-statements loops are syntactic sugar. Flowcharts correspond to programs. Flowcharts are directed graphs with nodes being basic activities, decision points or join points. A directed circle in a flowchart can be interpreted as a loop or as the usage of a go-to-statement. In general flowcharts it allowed to place join points arbitrarily, which makes it possible to create spaghetti structures, i.e., arbitrary jump structures, like the go-to-statements allows for the creation of spaghetti code. It is a matter of taste whether to make decision and joint points explicit nodes or not. If you strictly use decision and joint points the basic activities always have exactly one incoming and one outgoing edge. In concrete modelling languages like event-driven process chains, there are usually some more constraints, e.g., a constraint on decision points not to have more than one incoming edge or a constraint on join points to have not more than one outgoing edge. If you allow basic activities to have more than one incoming edge you do not need join points any more. Similarly, you can get rid of a decision point by using several outgoing edges by directly connecting the several branches of the decision point as outgoing edges to a basic activity and labelling the several branches with appropriate flow conditions. For example, in formcharts [3] we have choosen the option not to use explicit decision and join points. The discussion of this article is independent from the detail question of having explicit or implicit decision and join points, because both concepts are interchangeable. Therefore, in this article, we feel free to use both options.
138
2.1
D. Draheim
D-Charts
It is possible to define formation rules for a restricted class of flowcharts that correspond to structured programs. In [6] these diagrams are called Dijkstraflowcharts or D-flowcharts for short, named after Edgser W. Dijkstra. Figure 1 summarizes the semi-formal formation rules for D-flowcharts. (i) basic activity
A
(ii) sequence
C
(iii) case
C
D
C D C D D D
(iv) do-while
C
(v) repeat-until
C
D n
y
C
n D
C
y
Fig. 1. Semi-formal formation rules for structured flowcharts
Actually, the original definition of D-flowcharts in [6] consists of the formation rules (i) to (iv) with one formation rule for each programming language construct of a minimal structured imperative programming language with basic statements, sequences, case-constructs and while-loops with basic activities in the flowchart corresponding to basic statements in the programming language. We have added a formation rule (v) for the representation of repeat-until-loops and call flowcharts resulting from rules (i) to (v) structured flowcharts in the sequel. The flowchart in Fig. 2 is not a structured flowchart, i.e., it cannot be derived from the formation rules in Fig. 1. The flowchart in Fig. 2 can be interpreted as consisting of a repeat-until-loop exited via the α-decision point and followed by further activities ‘C’ and ‘D’. In this case, the β-decision point can lead to a branch that jumps into the repeat-until-loop in addition to the regular loop
n A
B
D
y
E
C
n
y
Fig. 2. Example flowchart that is not a D-flowchart
D
Frontiers of Structured Business Process Modeling
139
entry point via activity ‘A’, which infringes the structured programming and structured modelling principle and gives raises to spaghetti structure. This way, the flowchart in Fig. 2 visualizes the program in Listing 1 Listing 1 01 02 03 04 05 06 07
REPEAT A; B; UNTIL alpha; C; IF beta THEN GOTO 03; D;
The flowchart in Fig. 2 can also be interpreted as consisting of a while-loop exited via the β-decision point, whereat the while-loop is surrounded by a preceding activity ‘A’ and a succeeding activity ‘D’. In this case, the α-decision point can lead to a branch that jumps out of the while-loop in addition to the regular loop exit via the β-decision point, which again infringes the structured modelling principle. This way, the flowchart in Fig. 2 visualizes the program in Listing 2 Listing 2 01 02 03 04 05 06 07
A; REPEAT B; IF NOT alpha THEN GOTO 01 C; UNTIL NOT beta; D;
Flowcharts are visualization of programs. In general, a flowchart can be interpreted ambiguously as the visualization of several different program texts, because, for example, an edge from a decision point to a join point can be interpreted as go-to-statement on the one hand side or the back branch from an exit point of a repeat-until loop to the start of the loop. Structured flowcharts are visualizations of structured programs. Loops in structured programs and structured flowcharts enjoy the property that they have exactly one entry point and exactly one exit point. Whereas the entry point and the exit point of a repeat-until loop are different, the entry point and exit point of a while-loop are the same, so that a while-loop in a structured flowchart has exactly one contact point. That might be the reason that structured flowcharts that use only while-loops instead of repeat-until loops appear more normalized. Similarly, in a
140
D. Draheim
structured program and flowchart all case-constructs has exactly one entry point and one exit point. In general, additional entry and exit points can be added to loops and case constructs by the usage of go-to-statements in programs and by the usage of arbitrary decision points in flowcharts. In structured flowcharts, decision points are introduced as part of the loop constructs and part of the case construct. In structured programs and flowcharts, loops and case-constructs are strictly nested along the lines of the derivation of their abstract syntax tree. Business process models extend flowcharts with further modelling elements like a parallel split, parallel join or non-deterministic choice. Basically, we discuss the issue of structuring business process models in terms of flowcharts in this article, because flowcharts actually are business process model diagrams, i.e., flowcharts form a subset of business process models. As the constructs in the formation rules of Fig. 1 further business process modelling elements can also be introduced in a structured manner with the result of having again only such diagrams that are strictly nested in terms of their looping and branching constructs. For example, in such definition the parallel split and the parallel join would not be introduced separately but as belonging to a parallel modelling construct. 2.2
A Notion of Equivalence for Business Processes
Bisimilarity has been defined formally in [15] as an equivalence relation for infinite automaton behaviour, i.e., process algebra [12,13]. Bisimilarity expresses that two processes are equal in terms of their observable behaviour. Observable behaviour is the appropriate notion for the comparison of automatic processes. The semantics of a process can also be understood as opportunities of one process interacting with another process. Observable behaviour and experienced opportunities are different viewpoints on the semantics of a process, however, whichever viewpoint is chosen, it does not change the basic concept of bisimilarity. Business processes can be fully automatic; however, business processes can also be descriptions of human actions and therefore can also be rather a protocol of possible steps undertaken by a human. We therefore choose to explain bisimilarity in terms of opportunities of an actor, or, as a metaphor, from the perspective of a player that uses the process description as a game board – which neatly fits to the notions of simulation and bisimulation, i.e., bisimilarity. In general, two processes are bisimilar if starting from the start node they reveal the same opportunities and each pair of same opportunities lead again to bisimilar processes. More formally, bisimilarity is defined on labelled transition systems as the existence of a bisimulation, which is a relationship that enjoys the aforementioned property, i.e., nodes related by the bisimilarity lead via same opportunities to nodes that are related again, i.e., recursively, by the bisimilarity. In the non-structured models in this article the opportunities are edges leading out of an activity and the two edges leading out of a decision point. For our purposes, bisimilarity can be characterized by the rules in Fig. 3.
Frontiers of Structured Business Process Modeling
(i) (ii)
|
A C
A
D
iff
E
y
|
D n
A A
C
y (iii)
|
D n
D
141
C
|D
C
|E
iff D
F
|
F
Fig. 3. Characterization of bisimilarity for business process models
3
Resolving Arbitrary Jump Structures
Have a look at Fig. 4. As well as Fig. 2 it shows a business process model that is not a structured business process model. The business process described by the business process model in Fig. 4 can also be described in the style of a program text as we did in Listing 3. In this interpretation the business process model consists of a while-loop followed by a further activity ‘B’, a decision point that might branch back into the while-loop and eventually an activity ‘C’. Alternatively, the business process can also be described by structured business process models. Fig. 5 shows two examples of such structured business process models and Listings 4 and 5 show the corresponding program text representations that are visualized by the business process models in Fig. 5.
Listing 3 01 02 03 04 05
WHILE alpha DO A; B; IF beta THEN GOTO 02; C;
D
y
A
n y B
E n
C
Fig. 4. Example business process model that is not structured
142
D. Draheim
D n
y
D n
A
y
A
C
C
n B
n
E
B y
y
A
(i)
D y n B
E
A
A
(ii)
n D y B
Fig. 5. Structured business process models that replace a non-structured one
The business process models in Figs. 4 and 5 resp. Listings 3, 4 and 5 describe the same business process. They describe the same business process, because they are bisimilar, i.e., in terms of their nodes, which are, basically, activities and decision points, they describe the same observable behaviour resp. same opportunities to act for an actor – we have explained the notion of equality and the more precise approach of bisimilarity in more detail in Sect. 2.2. The derivation of the business process models in Fig. 5 from the formation rules given in Fig. 1 can be understood by the reader by having a look at its abstract syntax tree, which appears at tree ψ in Fig. 6. The proof that the process models in Figs. 4 and 5 are bisimilar is left to the reader as an exercise. The reader is also invited to find structured business process models that are less complex than the ones given in Fig. 5, whereas complexity is an informal concept that depends heavily on the perception and opinion of the modeller. For example, the model (ii) in Fig. 4 results from an immediate simple attempt to reduce the complexity of the model (i) in Fig. 5 by eliminating the ‘A’-activity which follows the α-decision point and connecting the succeeding ‘yes’-branch of the α-decision point directly back with the ‘A’-activity preceding the decision point, i.e., by reducing a while-loop-construct with a preceding statement to a repeatuntil-construct. Note, that the model in Fig. 5 has been gained from the model in Fig. 4 by straightforwardly unfolding it behind the β-decision point as much as necessary to yield a structured description of the business process. In what sense the transformation from model (i) to model (ii) in Fig. 5 has lowered complexity and whether it actually or rather superficially has lowered the complexity has to be discussed in the sequel. In due course, we will also discuss another structured business process model with auxiliary logic that is oriented towards identifying repeat-until-loops in the original process descriptions.
Frontiers of Structured Business Process Modeling
143
Listing 4 01 02 03 04 05 06 07 08 09 10
WHILE alpha DO A; B; WHILE beta DO BEGIN A; WHILE alpha DO A; B; END; C;
Listing 5 01 02 03 04 05 06 07 08 09 10
WHILE alpha DO A; B; WHILE beta DO BEGIN REPEAT A; UNTIL NOT alpha; B; END; C;
The above remark on the vagueness of the notion of complexity is not just a side-remark or disclaimer but is at the core of the discussion. If the complexity of a model is a cognitive issue it would be a straightforward approach to let people vote which of the models is more complex. If there is a sufficiently precise method to test whether a person has understood the semantics of a process specification, this method can be exploited in testing groups of people that have been given different kinds of specifications of the same process and concluding from the test results which of the process specifications should be considered as more complex. Such an approach relies on the preciseness of the semantics and eventually on the quality of the test method. It is a real challenge to search for a definition of complexity of models or their representations. What we expect is that less complexity has something to do with better quality, and before we undertake efforts in defining complexity of models we should first understand possibilities to measure of quality of models. The usual categories in which modellers and programmers often judge about complexity of models like understandability or readability are vague concepts themselves. Other categories like maintainability or reusability are more telling than understandability or readability but still vague. Of course, we can define metrics for the complexity of diagrams. For example, it is possible to define
144
D. Draheim
that the number of activity nodes used in a business process model increases the complexity of a model. The problem with such metrics is that it follows immediately that the model in Fig. 5 is more complex than the model in Fig. 4. Actually, this is what we believe.
4
Immediate Arguments for and against Structure
We believe that the models in Fig. 4 are more complex than the models in Fig. 5. A structured approach to business process models would make us believe that structured models are somehow better than non-structured models in the same way that the structured programming approach believes that structured programs are somehow better than non-structured programs. So either less complexity must not be always better or the credo of the structured approach must be loosened to a rule of thumb, i.e., the believe that structured models are in general better than non-structured models, despite some exceptions like our current example. An argument in favour of the structured approach could be that our current example is simply too small, i.e., that the aforementioned exceptions are made of small models or, to say it differently, that the arguments of a structured approach become valid for models beyond a certain size. We do not think so. We rather believe that our discussion scales, i.e., that the arguments that we will give in the sequel are also working or even more working for larger models. We want to approach these questions more systematically. In order to do so, we need to answer why we do believe that the models in Fig. 4 are more complex than the model in Fig. 5. Of course, the immediate answer is simply because they are larger and therefore harder to grasp, i.e., a very direct cognitive argument. But there is another important argument why we believe this. The model in Fig. 4 shows an internal reuse that the models in Fig. 4 do not show. The crucial point is the reuse of the loop consisting of the ‘A’-activity and the α-decision point in Fig. 4. We need to delve into this important aspect and will actually do this later. First, we want to discuss the dual question, which is of equal importance, i.e., we must also try to understand or try answer the question, why modellers and programmers might find that the models in Fig. 5 are less complex than the models in Fig. 4. A standard answer to this latter question could typically be that the edge from the β-decision point to the ‘A’-activity in Fig. 4 is an arbitrary jump, i.e., a spaghetti, whereas the diagrams in Fig. 5 do not show any arbitrary jumps or spaghetti phenomena. But the question is whether this vague argument can be made more precise. A structured diagram consists of strictly nested blocks. All blocks of a structured diagram form a tree-like structure according to their nesting, which corresponds also to the derivation tree in terms of the formation rules of Fig. 1. The crucial point is that each block can be considered a semantic capsule from the viewpoint of its context. This means, that ones the semantics of a block is understood by the analyst studying the model, the analyst can forget about the inner modelling elements of the block. This is not so for diagrams in general. This has been the argument of looking from outside onto a block in the
Frontiers of Structured Business Process Modeling
145
case a modeller want to know its semantics in order to understand the semantics of the context where it is utilized. Also, the dual scenario can be convincing. If an analyst is interested in understanding the semantics of a block he can do this in terms of the inner elements of a block only. Once the analyst has identified the block he can forget about its context to understand it. This is not so easy in a non-structured language. When passing an element, in general you do not know where you end up in following the several paths behind it. It is also possible to subdivide a non-structured diagram into chunks that are smaller than the original diagram and that make sense to understand as capsules. For example, this can be done, if possible, by transforming the diagram into a structured one, in which you will find regions of your original diagram. However, it is extra effort to do this partition. With the current set of modelling elements, i.e., those introduced by the formulation rules in Fig. 1, all this can be seen particularly easy, because each block has exactly one entry point, i.e., one edge leading into it. Fortunately, standard building blocks found in process modelling would have one entry point in a structured approach. If you have, in general, also blocks with more than one entry points, it would make the discussion interesting. The above argument would not be completely infringed. Blocks still are capsules, which a semantics that can be understood locally with respect to their appearance in a strictly nested structure of blocks. The scenario itself remains neat and tidy; the difference lays in the fact, that a block with more than one entry has a particular complex semantics in a certain sense. The semantics of a block with more than one entry is manifold, e.g., the semantics of a block with two entries is threefold. Given that, in general, we also have concurrency phenomena in a business process model, the semantics of block with two entry points, i.e., its behaviour or opportunities, must be understood for the case that the block is entered via one or the other entry point and for the case that the block is entered simultaneously. But this is actually not a problem; it just means a more sophisticated semantics and more documentation. Despite a more complex semantics, a block with multiple entries still remains an anchor in the process of understanding a business process model, because it is possible, e.g., to understand the model from inside to outside following strictly the tree-like nesting, which is a canonical way to understand the diagram, i.e., a way that is always defined. It is also always possible to understand the diagram sequentially from the start node to the end node in a controlled manner. The case constructs make such sequential proceeding complex, because they open alternative paths in a tree-like manner. The advantage of a structured diagram with respect to case-constructs is that each of the alternative paths that are spawned is again a block and it is therefore possible to understand its semantics isolated from the other paths. This is not so in a non-structured diagram, in general, where might have arbitrary jumps between the alternative paths. Similarly, if analyzing a structured diagram in a sequential manner, you do not get into arbitrary loops and therefore have to deal with a minimized risk to loose track.
146
D. Draheim
The discussion of the possibility to have blocks with more entry points immediately reminds us of the discussion we have seen within the business process community on multiple versus unique entry points for business processes in a setting of hierarchical decomposition. The relationship between blocks in a flat structured language and sub-diagrams in a hierarchical approach and how the play together in a structured approach is an important strand of discussion that we will come back to in due course. For the time being, we just want to point out the relationship of the discussion we just had on blocks with multiple entries and sub-diagrams with multiple entries. A counter-argument against sub-diagrams with multiple entries would be that they are more complex. Opponents of the argument would say, that it is not a real argument, because the complexity of the semantics, i.e., its aforementioned manifoldness, must be described anyhow. With sub-diagrams that may have no more than one entry point, you would need to introduce a manifoldness of diagrams each with a single entry point. We do not discuss here how to transform a given diagram with multiple entries into a manifoldness of diagrams – all we want to remark here that it easily becomes complicated because of the necessity to appropriately handle the aforementioned possibly existing concurrency phenomena. Eventually it turns out to be a problem of transforming the diagram together with its context, i.e., transforming a set of diagrams and sub-diagrams with possibly multiple entry points into another set of diagrams and sub-diagrams with only unique entry points. Defenders of diagrams with unique entry points would state that it is better to have a manifoldness of such diagrams instead of having a diagram with multiple entries, because, the manifoldness of diagrams documents better the complexity of the semantics of the modelled scenario. For a better comparison of the discussed models against the above statements we have repainted the diagram from Fig. 4 and diagram (ii) from Fig. 5 with the blocks they are made of and their abstract syntax trees resp. quasi-abstract syntax tree in Fig. 6. The diagram of Fig. 4 appears to the left in Fig. 6 as diagram Φ and diagram (ii) from Fig. 5 appears to the right as diagram Ψ . According to that, the left abstract syntax tree φ in Fig. 6 corresponds to the diagram from Fig. 4 and the right abstract syntax tree ψ corresponds to the diagram (ii) from Fig. 5. Blocks are surrounded by dashed lines in Fig. 6. If you proceed in understanding the model P hi in Fig. 6 you first have to understand a while-loop that encompasses the ‘A’-activity – the block labelled with number ‘5’ in model P hi. After that, you are not done with that part of the model. Later, after the β-decision point you are branched back to the ‘A’activity and you have to re-understand the loop it belongs to again, however, this time in a different manner, i.e., as a repeat-until loop – the block labelled with number ‘1’ in model P hi. It is possible to argue that, in some sense, this makes the model Φ harder to read than model Ψ . To say it differently, it is possible to view model Ψ as an instruction manual on how to read the model Φ. Actually, model Ψ is a bloated version of model Φ. It contains some modelling elements of model Φ redundantly, however, it enjoys the property that each modelling element has to be understood only in the context of one block and
Frontiers of Structured Business Process Modeling
147
6
6
D
y
D n
A
n
< A
7
1
)
y
4
C
B n 2
y E
B
E
C
n
y A
ii
7
6
ii
6 ii
iv E
C
I
A
B
E B
D
y C
1
2 1
A
D n
4 3
\
2 1
5
D A
B B
D
3 2
5
Fig. 6. Block-structured versus arbitrary business process model
its encompassing blocks. We can restate these arguments a bit more formal in analyzing the abstract syntax trees φ and ψ in Fig. 6. Blocks in Fig. 6 correspond to constructs that can be generated by the formation rules in Fig. 1. The abstract syntax tree ψ is an alternate presentation of the nesting of blocks in model Ψ . A node stands for a block and for the corresponding construct according to the formation rules. The graphical model Φ cannot be derived from the formation rules in Fig. 1. Therefore it does not posess an abstract syntax tree in which each node represent a unique graphical block and a construct the same time. The tree φ shows the problem. You can match the region labelled ‘1’ in model Φ as a block against while-loop-rule (iv) and you can subsequentially match the region labelled ‘2’ against the sequence-rule (iii). But then you get stuck. You can form a further do-while loop with rule (iv) out of the β-decision point and block ‘2’ as in model Ψ but the resulting graphical model cannot be interpreted as a part of model Φ any more. This is because the edge from activity ‘B’ to the β-decision point graphically serves both as input branch to the decision point and as back branch to the decision point. This graphical problem is resolved in the abstract syntax tree φ by reusing the activity ‘B’ in the node that corresponds to node ‘5’ in tree ψ in forming a sequence according to rule (ii) with the results that the tree φ is actually no tree any longer. Similarly, the reuse of the modelling elements in forming node ‘6’ in the abstract syntax tree φ visualizes the double interpretation of this graphical region as both a do-while loop and repeat-until loop.
148
5
D. Draheim
Structure for Text-Based versus Graphical Specifications
In Sect. 4 we have said that an argument for a structured business process specification is that it is made of strictly nested blocks and that each identifiable block forms a semantic capsule. In the argumentation we have looked at the graphical presentation of the models only and now we will have a look also at the textual representations. This section needs a disclaimer. We are convinced that it is risky in the discussion of quality of models to give arguments in terms of cognitive categories like understandability, readability, cleanness, well-designedness, well-definedness. These categories tend to have a insufficient degree definedness themselves so that argumentations based on them easily suffer a lack of falsifiability. Nevertheless, in this Section, in order abbreviate, we need to speak directly about the reading ease of specifications. The judgements are our very own opinion, an opinion that expresses our perception of certain specifications. The reader may have a different opinion and this would be interesting in its own right. At least, the expression of our own opinion may encourage the reader to judge about the reading ease of certain specifications. As we said in terms of complexity, we think that the model in Fig. 4 is easier to understand than the models in Fig. 5. We think it is easier to grasp. Somehow paradoxically, we think the opposite about the respective text representation, at least at a first sight, i.e., as long as we have not internalized to much all the different graphical models in listings. This means, we think that the text representation of the models in Fig. 4, i.e., Listing 3, is definitely harder to understand than the text representation of both models in Fig. 5, i.e., Listings 4 and 5. How comes? Maybe, the following observation helps, i.e., that we also think that the graphical model in Fig. 5 is also easier to read than the models textual representation in Listing 3 and also easier to read than the two other Listings 4 and 5. Why is Listing 5 so relatively hard to understand? We think, because there is no explicitly visible connecting between the jumping-off point in line ‘04’ and the jumping target in line line ‘02’. Actually, the first thing we would recommend in order to understand Listing 5 better is to draw its visualization, i.e., the model in Fig. 5, or to concentrate and to visualize it in our mind. By the way, we think that drawing some arrows in Listing 3 as we did in Fig. 7 also help. The two arrows already help despite the fact that they make explicit only a part of the jump structure – one possible jump from line ‘01’ to line ‘03’ in case the α-condition becomes invalid must still be understood by the indentation of the text. All this is said for such a small model consisting of a total of five lines. Imagine, if you had to deal with a model consisting of several hundreds lines with arbitrary goto-statements all over the text. If it is true that the model in Fig. 4 is easier to understand than the models in Fig. 5 and at the same time Listing 3 is harder to understand than Listings 4 and 5 this may lead us to the assumption that the understandability of graphically presented models follows other rules than the understandability of textual representation. Reasons for this may be, on the
Frontiers of Structured Business Process Modeling
01 02 03 04 05
149
WHILE alpha DO A; B; IF beta THEN GOTO 02; C;
Fig. 7. Listing enriched with arrows for making jump structure explicit
one hand side, the aforementioned lack of explicit visualizations of jumps, and on the other hand side, the one-dimensional layout of textual representations. The reason for why we have given all of these arguments in this section is not in order to promote visual modelling. The reason is that we see a chance that they might explain why the structural approach has been so easily adopted in the field of programming. The field of programming was and still is dominated by text-based specifications – despite the fact that we have seen many initiatives from syntax-directed editors over computer-aided software engineering to model-driven architecture. It is fair to remark that the crucial characteristics of mere textual specification in the discussion of this Section, i.e., lack of explicit visualization of jumps, or, to say it in a more general manner, support for the understanding of jumps, is actually addressed in professional coding tools like integrated development environments with their maintenance of links, code analyzers and profiling tools. The mere text-orientation of specification has been partly overcome by today’s integrated development environments. Let us express once more that we are no promoters of visual modelling or even visual programming. In [3] we have deemphasized visual modelling. We strictly believe that visualizations add value, in particular, if it is combined with visual meta-modelling [10,11]. But we also believe that mere visual specification is no silver bullet, in particular, because it does not scale. We believe in the future of a syntax-direct abstract platform with visualization capabilities that overcomes the gap between modelling and programming from the outset as proposed by the work on AP1 [8,9] of the Software Engineering research group at the University of Auckland.
6
Structure and Decomposition
The models in Fig. 5 are unfolded versions of the model in Fig. 4. Some modelling elements of the diagram in Fig. 5 occur redundantly in each model in Fig. 4. Such unfolding violate the reuse principle. Let us concentrate on the comparison of the model in Fig. 5 with model (i) in Fig. 5. The arguments are similar for diagram (ii) in Fig. 5. The loop made of the α-decision point and the activity ‘A’ occurs twice in model (i). In the model in Fig. 5 this loop is reused by the jump from the β-decision point albeit via an auxiliary entry point. It is important to understand that reuse is not about the cost-savings of avoiding the repainting modelling elements but about increasing maintainability.
150
D. Draheim
Imagine, in the lifecycle of the business process a change to the loop consisting of the activity ‘A’ and the α-decision point becomes necessary. Such changes could be the change of the condition to another one, the change of the activity ‘A’ to another one or the refinement of the loop, e.g., the insertion of a further activity into it. Imagine that you encounter the necessity for changes by reviewing the start of the business process. In analyzing the diagram, you know that the loop structure is not only used at the beginning of the business process but also later by a possible jump from the β-decision point to it. You will now further analyze whether the necessary changes are only appropriate at the beginning of the business process or also later when the loop is reused from other parts of the business process. In the latter case you are done. This is the point where you can get into trouble with the other version of the business process specification as diagram (i) in Fig. 5. You can more easily overlook that the loop is used twofold in the diagram; this is particularly true for similar examples in larger or even distributed models. So, you should have extra documentation for the several occurrences of the loop in the process. Even in the case that the changes are relevant only at the beginning of the process you would like to review this fact and investigate whether the changes are relevant for other parts of the process. It is fair to remark, that in the case that the changes to the loop in question are only relevant to the beginning of the process, the diagram in Fig. 5 bears the risk that this leads to an invalid model if the analyst oversees its reuse from later stages in the process, whereas the model (i) in Fig. 5 does not bear that risk. But we think this kind of weird fail-safeness can hardly be sold as an advantage of model (i) in Fig. 5. Furthermore, it is also fair to remark, that the documentation of multiple occurrences of a model part can be replaced by appropriate tool-support or methodology like a pattern search feature or hierarchical decomposition as we will discuss in due course. All this amounts to say that maintainability of a model cannot be reduced to its presentation but depends on a consistent combination of presentational issues, appropriate tool support and defined maintenance policies and guidelines in the framework of a mature change management process. We now turn the reused loop consisting of the activity ‘A’ and the α-decision point in Fig. 5 into an own sub-diagram in the sense of hierarchical decomposition, give it a name – let us say ‘DoA’ – and replace the relevant regions in diagram (i) in Fig. 5 by the respective, expandable sub-diagram activity. The result is shown in Fig. 8. Now, it is possible to state that this solution combines the advantages from both kinds of models in question, i.e., it consists of structured models at all levels of the hierarchy and offers an explicit means of documentation of the places of reuse. But a caution is necessary. First, the solution does not free the analyst to actually have a look at all the places a diagram is used after he or she has made a change to the model, i.e., an elaborated change policy is still needed. In the small toy example, such checking is provoked, but in a tool you usually do not see all sub-diagrams at once, but rather step through the levels of the hierarchy and the sub-diagrams with links. Remember that the usual motivation to introduce hierarchical decomposition and tool-support for hierarchical decomposition is the
Frontiers of Structured Business Process Modeling
151
C n DoA +
E
B
y A
DoA +
B
DoA
D
y
A
Fig. 8. Example business process hierarchy
C n DoA +
E
B
y Ado +
B
Ado A
DoA +
DoA
D
y
A
Fig. 9. Example for a deeper business process hiarchy
desire to deal with the complexity of large and very large models. Second, the tool should not only support the reuse-direction but should also support the inverse use-direction, i.e., it should support the analyst with a report feature that lists all places of reuse for a given sub-diagram. Now let us turn to a comparative analysis of the complexity of the modelling solution in Fig. 8 and the model in Fig. 5. The complexity of the top-level diagram in the model hierarchy in Fig. 8 is not any more significantly higher than the one of the model in Fig. 5. However, together with the sub-diagram, the modelling solution in Fig. 8 again shows a certain complexity. It would be
152
D. Draheim
possible to neglect a reduction of complexity by the solution in Fig. 8 completely with the hint that the disappearance of the edge representing the jump from the β-decision point into the loop in Fig. 5 is bought by another complex construct in Fig. 8, wit to the dashed line from the activity ‘DoA’ to the targeted subdiagram. The jump itself can be still seen in Fig. 8, somehow, unchanged as an edge from the β-decision point to the activity ‘A’. We do not think so. The advantage of the diagram in Fig. 8 is that the semantic capsule made of the loop in question is already made explicit as a named sub diagram, which means an added documentation value. Also, have a look at Fig. 9. Here the above explanations are even more substantive. The top-level diagram is even less complex than the top-level diagram in Fig. 8, because the activity ‘A’ now has moved to an own level of the hierarchy. However, this comes at the price now, that the jump from the β-decision point to the activity ‘A’ in Fig. 5 now re-appears in Fig. 9 as the concatenation of the ‘yes’-branch in the top-level diagram, the dashed line leading from the activity ‘Ado’ to the corresponding sub-diagram at the next level and the entry edge of this sub-diagram.
7
On Business Domain-Oriented versus DocumentationOriented Modeling
In Sects. 3 through 6 we have discussed structured business process modelling for those processes that actually have a structured process specification in terms of a chosen fixed set of activities. In this Section we will learn about processes that do not have a structured process specification in that sense. In the running example of Sects. 3 through 6 the fixed set of activities was given by the activities of the initial model in Fig. 4 and again we will explain the modelling challenge addressed in this Section as a model transformation problem.
reject workpiece due to defects
quality must be improved
y
handle workpiece n
(i)
dispose deficient workpiece
amount exceeds threshold
y
quality insurance n
finish workpiece
prepare purchase order
(ii)
revision is necessary
y n
approve purchase order
y n
submit purchase order
Fig. 10. Two example business processes without structured presentation with respect to no other than their own primitives
Frontiers of Structured Business Process Modeling
y
y A
E
B
D
n C
153
n D
Fig. 11. Business process with cycle that is exited via two distinguishable paths
Consider the example business process models in Fig. 10. Each model contains a loop with two exits to paths that lead to the end node without the opportunity to come back to the originating loop before reaching the end state. It is known [1,5,6] that the behaviours of such loops cannot be expressed in a structured manner, i.e., by a D-chart as defined in Fig. 1 solely in terms of the same primitive activities as those occurring in the loop. Extra logic is needed to formulate an alternative, structured specification. Fig. 11 shows this loop-pattern abstractly and we proceed to discuss this issues with respect to this abstract model. Assume that there is a need to model the behaviour of a business process in terms of a certain fixed set of activities, i.e., the activities ‘A’ through ‘D’ in Fig. 11. For example, assume that they are taken from an accepted terminology of a concrete business domain. Other reasons could be that the activities stem from existing contract or service level agreement documents. You can also assume that they are simply the natural choice as primitives for the considered work to be done. We do not delve here into the issue of natural choice and just take for granted that it is the task to model the observed or desired behaviour in terms of these activities. For example, we could imagine an appropriate notion of cohesion of more basic activities that the primitives we are restricted to, or let’s say selfrestricted to, adhere to. Actually, as it will turn out, for the conclusiveness of our current argumentation there is no need for an explanation how a concrete fixed set of activities arises. What we need for the conclusiveness of our current argumentation is the demand on the activities, that they are only about actions and objects that are relevant in the business process. Fig. 12 shows a structured business process model that is intended to describe the same process as the specification in 11. In a certain sense it fails. The extra logic introduced in order to get the specification into a structured shape do not belong to the business process that the specification aims to describe. The model in Fig. 12 introduces some extra state, i.e., the Boolean variable δ, extra activities to set this variable so that it gets the desired steering effect and an extra δ-decision point. Furthermore, the original δ-decision point in the model of Fig. 11 has been changed to a new β∧δ-decision point. Actually, the restriction of the business process described by Fig. 11 onto those particles used in the model in Fig. 11 is bisimilar to this process. The problem is that the model in Fig. 12 is a hybrid. It is not only a business domain-oriented model any more, it now has also some merely documentation-related parts. The extra logic and state only serve
154
D. Draheim
y A
G:=true
A
EG
y B
D n
n
G:=false G
C
D
Fig. 12. Resolution of business process cycles with multiple distinguishable exits by the usage of auxiliary logic and state
the purpose to get the diagram into shape. It needs clarification of the semantics. Obviously, it is not intended to change the business process. If the auxiliary introduced state and logic would be also about the business process, this would mean, for example, that in the workshop a mechanism is introduced, for example a machine or a human actor that is henceforth responsible for tracking and monitoring a piece of information δ. So, at least what we need is to explicitly distinguish those elements in such a hybrid model. The question is whether the extra complexity of a hybrid domain- and documentation-oriented modelling approach is justified by the result of having a structured specification.
8
Conclusion
On a first impression, structured programs and flowcharts appear neat and programs and flowcharts with arbitrary jumps appear obfuscated, muddle-headed, spaghetti-like etc. But the question is not to identify a subset of diagrams and programs that look particularly fine. The question is, given a behaviour that needs description, whether it makes always sense to replace a description of this behaviour by a new structured description. What efforts are needed to search for a good alternative description? Is the resulting alternative structured description as nice as the original non-structured description? Furthermore, we need to gain more systematic insight into which metrics we want to use to judge the quality of a description of a behaviour, because categories like neatness or prettiness are not satisfactory for this purpose if we take for serious that our domain of software development should be oriented rather towards engineering [14,2] than oriented towards arts and our domain of business management should be oriented rather towards science [4], though, admittedly, both fields are currently still in the stage of pre-paradigmatic research [7]. All these issues form the topic of investigation of this article. For us, the definitely working theory of quality of business process models would be strictly pecuniary, i.e., it would enable us to define a style guide for
Frontiers of Structured Business Process Modeling
155
business process modelling that eventually saves costs in system analysis and software engineering projects. The better the cost-savings realized by the application of such style-guide the better such theory. Because our ideal is pecuniary, we deal merely with functionality. There is no cover, no aesthetics, no mystics. This means there is no form in the sense of Louis H. Sullivan [16] – just function.
References 1. B¨ ohm, C., Jacopini, G.: Flow Diagrams, Turing Machines and Languages With Only Two Formation Rules. Communications of the ACM 3(5) (1966) 2. Buxton, J.N., Randell, B.: Software Engineering – Report on a Conference Sponsored by the NATO Science Committee, Rome, October 1969. NATO Science Committee (April 1970) 3. Draheim, D., Weber, G.: Form-Oriented Analysis – A New Methodology to Model Form-Based Applications. Springer, Heidelberg (2004) 4. Gulick, L.: Management is a Science. Academy of Management Journal 1, 7–13 (1965) 5. Knuth, D.E., Floyd, R.W.: Notes on Avoiding ‘Go To’ Statements. Information Processing Letters 1(1), 23–31, 177 (1971) 6. Rao Kosaraju, S.: Analysis of Structured Programs. In: Proceedings of the 5th Annual ACM Symposium on Theory of Computing, pp. 240–252 (1973) 7. Kuhn, T.S.: The Structure of Scientific Revolutions. University of Chicago Press (December 1996) 8. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. In: Draheim, D., Weber, G. (eds.) TEAA 2006. LNCS, vol. 4473, pp. 270–284. Springer, Heidelberg (2007) 9. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. PhD thesis, University of Auckland (March 2008) 10. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T., K¨ ung, J.: A Concept of an Adaptive and Iterative Meta- and Instance Modeling Process. In: Proceedings of DEXA 2007 - 18th International Conference on Database and Expert Systems Applications. Springer, Heidelberg (2007) 11. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T., K¨ ung, J.: Intuitive Visualization-Oriented Metamodeling. In: Proceedings of DEXA 2009 - 20th International Conference on Database and Expert Systems Applications. Springer, Heidelberg (2009) 12. Milner, R.: A Calculus of Communication Systems. LNCS, vol. 92. Springer, Heidelberg (1980) 13. Milner, R.: Communication and Concurrency. Prentice-Hall, Englewood Cliffs (1989) 14. Naur, P., Randell, B. (eds.): Software Engineering – Report on a Conference Sponsored by the NATO Science Committee, Garmisch, October 1968. NATO Science Committee (January 1969) 15. Park, D.: Concurrency and Automata on Infinite Sequences. In: Deussen, P. (ed.) GI-TCS 1981. LNCS, vol. 104, pp. 167–183. Springer, Heidelberg (1981) 16. Sullivan, L.H.: The Tall Office Building Artistically Considered. Lippincott’s Magazine 57, 403–409 (1896)
Information Systems for Federated Biobanks Johann Eder1 , Claus Dabringer1 , Michaela Schicho1 , and Konrad Stark2 1 2
Alps Adria University Klagenfurt, Department of Informatics Systems {Johann.Eder,Claus.Dabringer,Michaela.Schicho}@uni-klu.ac.at University of Vienna, Department of Knowledge and Business Engineering
[email protected]
Abstract. Biobanks store and manage collections of biological material (tissue, blood, cell cultures, etc.) and manage the medical and biological data associated with this material. Biobanks are invaluable resources for medical research. The diversity, heterogeneity and volatility of the domain make information systems for biobanks a challenging application domain. Information systems for biobanks are foremost integration projects of heterogenous fast evolving sources. The European project BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) has the mission to network European biobanks, to improve resources for biomedical research, an thus contribute to improve the prevention, diagnosis and treatment of diseases. We present the challenges for interconnecting European biobanks and harmonizing their data. We discuss some solutions for searching for biological resources, for managing provenance and guaranteeing anonymity of donors. Furthermore, we show how to support the exploitation of such a resource in medical studies with specialized CSCW tools. Keywords: biobanks, data quality and provenance, anonymity, heterogeneity, federation, CSCW.
1
Introduction
Biobanks are collections of biological material (tissue, blood, cell cultures, etc.) together with data describing this material and their donors and data derived from this material. Biobanks are of eminent importance for medical research for discovering the processes in living cells, the causes and effects of diseases, the interaction between genetic inheritance and life style factors, or the development of therapies and drugs. Information systems are an integral part of any biobank and efficient and effective IT support is mandatory for the viability of biobanks. For an example: A medical researcher wants to find out why a certain liver cancer generates a great number of metastasis in some patients and in others not. This knowledge would help to improve the prognosis, the therapy, the selection
The work reported here was partially supported by the European Commission 7th Framework program - project BBMRI and by the Austrian Ministry of Science and Research within the program Gen-Au - project GATIB.
A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 156–190, 2009. c Springer-Verlag Berlin Heidelberg 2009
Information Systems for Federated Biobanks
157
of therapies and drugs for a particular patient, and help to develop better drugs. For such a study the researcher needs besides biological material (cancer tissue) an enormous amount of data: clinical records of the patients donating the tissue, lab analysis, microscopic images of the diseased cells, information about the life style of patients, genotype information (e.g. genetic variations), phenotype information (e.g. gene expression profiles), etc. Gathering all these data in the course of a single study would be highly inefficient and costly. A biobank is supposed to deliver the data needed for this type of research and share the data and material among researchers. From the example above it is clear that information systems for biobanks are huge applications. The challenge is to integrate data stemming from very different autonomous sources. So biobanks are foremost integration and interoperability projects. Another important issue is the dynamics of the field: new insight leads to more differentiated diagnosis, new analysis methods allow the assessment of additional measurements, or improve the accuracy of measurements. So an information system for biobanks will be continuously evolving. And last but not least, biobanks store very detailed personal information about donors. To protect the privacy and anonymity of the donors is mandatory and misuse of the stored information has to be precluded. In recent years biobanks have been set up in various organizations, mainly hospitals and medical and pharmaceutical research centers. Since the availability of material and data is a scarce resource for medical research, the sharing of the available material within the research community increased. This leads to desire to organize the interoperation of biobanks in a better way. The European project BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) has the mission to network European biobanks to improve resources for biomedical research an thus contribute to improve the prevention, diagnosis and treatment of diseases. BBMRI is organized in the framework of European Strategy Forum on Research Infrastructures (ESFRI). In this paper we give a broad overview of the requirements for IT systems for biobanks, present the architecture of information systems supporting biobanks, discuss possible integration strategies for connecting European biobanks and discuss the challenges for this integration. Furthermore, we show how such an infrastructure can be used and present a support system for medical research using data from biobanks. The purpose of this paper is rather painting the whole picture of challenges of data mangement for federated biobanks than presenting detailed technical solutions. This paper is an extended version of [20]
2
What Are Biobanks?
A biobank, also known as a biorepository, can be seen as an interdisciplinary research platform that collects, stores, processes and distributes biological materials and the data associated with those materials. In short: biobank = biological material + data. Typically, those biological materials are human biospecimens such as tissue, blood or body fluids - and
158
J. Eder et al.
the data are the donor-related clinical information of that biological material. Human biological samples in combination with donor-related clinical data are essential resources for the identification and validation of biomarkers and the development of new therapeutic approaches (drug discovery), especially in the development of systems for biological approaches to study the disease mechanisms. Further on, they are used to explore and understand the function and medical relevance of human genes, their interaction with environmental factors and the molecular causes of diseases [16]. Besides human-driven biobanks, a biobank can also include samples from animals, cell and bacterial cultures, or even environmental samples. Biobanks became a major issue in the field of genomics and biotechnology, in recent years. According to the type of stored samples and the medical-scientific domain biobanks can differ in many forms. 2.1
The Variety of Biobanks and Medical Studies
The development of biobanks results in a very heterogeneous concept. Each biobank pursues its own strategy and specific demands on quality and annotation of the collected samples. According to [39] we distinguish between three major biobank types considering exclusively human-driven biobanks: 1. Population based biobanks. Population based cohorts are valuable for assessing the natural occurrence and progression of common diseases. They contain a huge number of biological samples from healthy or diseased donors, representative for a concrete region or ethic cohort (population isolated) or from the general population over a large period of time (longitudinal population). Examples for large population based biobanks are the Icelandic DeCode Biobank and UK Biobank. 2. Disease-oriented biobanks. Their mission is to obtain biomarkers of disease through prospective and/or retrospective collections of e.g. tumour and nontumour samples with derivatives such as DNA/RNA/proteins. The collected samples are associated to clinical data and clinical trials [39]. This groups of biobanks are typically pathology archives like the MUG Biobank Graz. A special kind of biobanks are twin registries such as GenomeEUtwin biobank, which contain approximately equal numbers of monozygotic (MZ) and dizygotic (DZ) twins. With such biobanks the parallel dissection of effects of genetic variation in a homogeneous environment (DZ twins) and of environmental effects against an identical genetic background (MZ twins) is possible. [52] These registries are also partially suited to distinguish between the genetic and non-genetic basis of diseases. In [18] Cambon-Thomsen shows, that biobanks can vary in size, access mode, status of institution or scientific sector in which the samples were collected: 1. Medical and academic research. In medical genetic studies of disease usually small case- or family-based repositories are involved. Population-based collections, which are also usually small, have been used for academic research
Information Systems for Federated Biobanks
159
for long period of time. Some large epidemiological studies have also involved the collection of a large number of samples. 2. Clinical case/control studies. The primarily use of collected samples in hospitals is for informing diagnosis, for the clinical or therapeutic follow up as well as for the discovery or validation of genetic and non-genetic risk factors. Large numbers of tissue sections have been collected by pathology departments over the years. Transplantations using cells, tissues or even organs from unrelated donors also led to the development of tissue and cell banks. 3. Biotechnology domain. Within this domain collections of reference cell lines (e.g cancer cell lines or antibody-producing cell lines) and stem cell lines of various origin are obtained. They are mainly used in biotechnology research and development. 4. Judiciary domain. Biobanks host large collections of different sources of biological material, data and DNA fingerprints, which have very restricted usage. 2.2
Collected Material and Stored Data
Biobanks are not something new in the world of medicine and biological research. The systematic collection of human samples goes back to 19th century, including formaldehyde-fixed, paraffin-embedded or frozen material [27]. Most biobanks are developed in order to support a research program in a specific type of disease or to collect samples from a particular group of donors. Due to the large resource requirements, biobanks within an institution usually conglomerate to reduce high costs. This merging typically results in the fact that biobanks have many kinds of samples and many different types (also called domains) of data. Material / Sample Types. Samples can include any kind of tissue, fluid or other material that can be obtained from an individual. Usually, biospecimens in a biobank are blood and blood components (serum), solid tissues such as small biopsies and so on. An important collection in biobanks are the so-called normal samples. These are that kind of tissue samples which are free of diagnosed disease. For instance in some cases of the medical research (e.g case/control studies) it is an important issue that there exist corresponding normal samples to several diseased diagnosed samples, which can be used as controls, in order to get the bottom of specific diseases or gene mutations. The biological samples can be collected in various ways. Samples may be taken in the course of diagnostic investigations as well as during treatment of diseases. For instance, in biopsies small human tissue specimen are obtained in order to determine the type of a cancer. Surgical resections of tumours provide larger tissue samples, which may be used to specify the type of disease treatment. Autopsies are another valuable source of human tissues, where specimen may be taken from various locations which reflect the effects of a disease in different organs of a patient. The obtained biological materials are special preserved to keep them durable over a long time.
160
J. Eder et al.
Sample Data. The stored data from a donor, which come along with the collected sample can be very extensive and various. According to [1] this data includes: – General information (e.g. race, gender, age, ...) – Lifestyle and environmental information (e.g. smoker - non smoker, living in a big city with high environmental pollution or living in rural areas) – History of present illnesses, treatments and responses (e.g prescribed drugs and the reactions of adverse) – Longitudinal information (e.g. a sequence of blood tests after tissue collection in order to test the progress behavior of diseases) – Clinical outcomes (e.g. success of the treatment: Is the donor still living?) – Data from gene expression profiles, laboratory data,... Technically, the types of data range from typical record keeping, over text and various forms of images to gene vectors and 3-D chemical structures. Ethical and Legal Issues. Donors of biological materials must be informed about purpose and intended use of their samples. Typically, the donor signs an informed consent [10] which allows the use of samples for research and obliged the biobank institution to guarantee privacy of the donor. The usage of material in studies usually requires approval by ethics boards. Since the identity of donors must be protected, the relationship between a donor and its sample must not be revealed. Technical solutions for guaranteeing privacy issues are discussed in section 6. 2.3
Samples as a Scarce Resource
Biobanks prove to be a central key resource to increase the effectivity of medical research. Besides the high costs with human biological samples, they are available in a limited amount. E.g. Ones a piece of a liver-tissue is cut off and is used for a study, this piece of tissue is expended. Therefore, it is important to avoid redundant analysis and achieve the most efficient and effective use of non-renewable biological material [17]. A common and synergetic usage of this resource will enable lots of research projects especially in case of rare diseases with very limited material available. In silico experiments [46] play an important role in the context of biobanks. The aim is to answer as many research questions as possible without access to the samples themselves. Therefore, already acquired data of samples are stored in databases and shared among interested researchers. So modern biobanks offer the possibility to decrease long-term costs of research and development as well as effective data acquisition and usage.
3
Biobanks as Integration Project
In section 2 we already mentioned that biobanks may contain various types of collections of biological materials. Depending on the type of biobank, its organizational and research environment, human tissue, blood, serum, isolated
Information Systems for Federated Biobanks
161
RNA/DNA, cell lines or others can be archived. Apart from the organizational challenges, an elaborated information system is required for capturing all relevant information of samples, managing borrow and return activities and supporting complex search enquiries. 3.1
Sample Management in Biobanks
An organizational entity, which is commissioned to establish a biobank and its operations, requires suitable storage facilities (e.g. cryo-tanks, paraffin block storage systems) as well as security measures for protecting samples from damage and for preventing unauthorized access. If a biobank is built on the basis of existing resources (material and data), a detailed evaluation is essential. The collection process, the inventory and documentation of samples has to be assessed, evaluated and optimized. The increasing number of biobanks all over the world has drawn the attention of international organizations, encouraging the standardization of processes, sample and data management of biobanks. Standardization of Processes. Managing a biobank is a dynamic process. The biological collections may grow continuously, additional collections may be integrated and samples may be used in medical studies and research projects. Standard operating procedures are required for the most relevant processes, defining competencies, roles, control mechanisms and documentation protocols. E.g. The comparison of two or more gene expression profiles computed by different institutes is only applicable if all gene expressions were determined by the same standardized process. The Organization for Economic Cooperation and Development (OECD) released the definition of Biological Resource Centers (BRC) which ”must meet the high standards of quality and expertise demanded by the international community of scientists and industry for the delivery of biological information and materials” [8]. BRCs are certified institutions providing high quality biological material and information. The model of BRCs may assist the consolidation process of biobanks defining quality management and quality assurance measures [38]. Guidelines for implementing standards for BRCs may be found in recent works such as [12,35]. Data and Data Sources. Clinical records and diagnoses are frequently available as semi-structured data or even stored as plain-text in legacy medical information system. The data available from diverse biobanks do not only include numeric and alphabetic information but also complex images such as microphotographs of pathology sections, pictures generated by medical imaging procedures as well as graphical representations of the results of analytical diagnostic procedures [11]. It is a matter of very large volumes of data. Somewhere data or findings are even archived only in printed version or stored in heterogenous formats. 3.2
Different Kinds of Heterogeneity
Since biobanks may involve many different datasources it is obvious that heterogeneity is ever-present. Biobanks may comprise interfaces to sample management
162
J. Eder et al.
systems, labor information systems, research information systems, etc. The origin of the heterogeneity lies in different data sources (clinical, laboratory systems, etc), hospitals, research institutes and also in the evolution of involved disciplines. Heterogeneity appearing in biobanks comes in various forms and thus can be divided into two different classes. The first class of heterogeneity can be found between different datasources. This kind of heterogeneity is mostly caused by the independent development of the different datasources. Here we have to deal with several different types of mismatches which all lead to heterogeneity between the systems as shown in [43,32]. Typical mismatches can be found in: – – – – – – – – –
Attribute namings (e.g. disease vs DiseaseCode) Different attribute encodings (e.g. weight in kg vs lbm) Content of attributes (e.g. homonyms, synonyms, ...) Precision of attributes (e.g. sample size: small, medium, large vs mm3 , cm3 ) Different attribute granularity Different modeling of schemata Multilingualism Quality of the data stored Semi-structured data (incompleteness, plain-text,...)
The second class of heterogeneity is the heterogeneity within one datasource. This kind of heterogeneity may not be recognized at a first glance. But as [39] show the scientific value of biobank content increases with the amount of clinical data linked to certain samples (see figure 1). The longer data will be kept in biobanks the greater its scientific value is. On the other hand keeping data in biobanks for a long time leads to heterogeneity because medical progress leads to changes in database structures and the modeled domain. Modern biobanks support time series analysis and use of material harvested over a long period of time. A particular difficulty for these uses of material and data is the correct
Fig. 1. The correlation of scientific value of biobank content and availability of certain content is shown. One can clearly see that the scientific value increases where the availability of data decreases [39].
Information Systems for Federated Biobanks
163
representation and treatment of changes. Some exemplary changes that arise in this context are: – – – – –
Changes in disease codes (e.g ICD-9 to ICD-10 [6] in the year 2000) Progress in biomolecular methods results in higher accuracy of measurements Extension of relevant knowledge (e.g. GeneOntology [4] is changed daily) Treatments and standard procedures change Quality of sample conservation increases, etc.
Furthermore, also the technical aspects within one biobank are volatile: data structures, semantics of data, parameters collected, etc. When starting a biobank project one must be aware of the above mentioned changes. Biobanks should be augmented to represent these changes. This representation then can be used to reason about the appropriateness for using a certain set of data or material together in specific studies. 3.3
Evolution
Wherever possible biobanks should provide transformations to map data between different versions. Using ontologies to annotate content of biobanks can be quite useful. By providing mapping support between different ontologies the longevity problem can be addressed. Further on, versioning and transformation approaches can help to support the evolution of biobanks. Techniques from temporal databases and temporal data warehouses can be used for the representation of volatile data together with version mappings to transform all data to a selected version [19,26,51,21,22]. This knowledge can be directly applied to biobanks as well. 3.4
Provenance and Data Quality
From the perspective of medical research, the value of biological material is tightly coupled with the amount, type and quality of associated data. Though, medical studies usually require additional data that can not be directly provided by biobanks or is not available at all. The process of collecting relevant data for biospecimens is denoted as sample annotation and is usually done in context of prospective studies based on specified patient cohorts. For instance, if the family anamnesis is to be collected for a cohort of liver cancer patients, various preprocessing and filtering steps are necessary. In some cases, different medical institutes or even hospitals have to be contacted. Patients have to be identified correctly in external information systems, and anamnesis data is extracted and collected in predefined data structures or simple spreadsheets. The collected data is combined with the data supplied by the biobank and constitutes the basis for hypotheses, analyses and experiments. Thus, additional data is created in context of studies: predispositions for diseases, gene expression profiles, survival analyses, publications etc. The collected and created data represents an added value for biospecimens, since it may be used in related or retrospective studies. Therefore, if a biobank is committed to support collaborative research
164
J. Eder et al.
activities, an appropriate research platform is required. Generally, the aspects of contextualization and provenance have to be considered by such a system. Contextualization is the ability to link and annotate an object (sample) in different contexts. A biospecimen may be used in various studies and projects and may be assigned different annotations in each of it. Further, these annotations have to be processable allowing efficient retrieval. Further, contextualization allows to organize and interrelate contexts. That is, study data is accessible only to selected groups and persons having predefined access rights. Related studies and projects may be linked to each other, whereas collaboration and data sharing is supported. The MUG biobank uses a service-oriented CSCW system as an integrated research platform. More details about the system are given in section 5.1. Data provenance is used to document the origin of data and tracks all transformation steps that are necessary to reaccess or reproduce the data. It may be defined as the background knowledge that enables a piece of data to be interpreted and used correctly within context [37]. Alternatively, data provenance may be seen from a process-oriented perspective. Data may be transformed by a sequence of processing steps which could be small operations (SQL joins, aggregations), the result of tools (analysis services) or the product of a human annotation. Thus, these transformations form a “construction plan” of a data object, which could be reasonably used in various contexts. Traceable transformation processes are useful for documentation purposes, for instance, for the materials and methods section of publications. Generally, the data quality may be improved due to the transparency of data access and transformation. Moreover, relevant processes may be marked as standard or learning processes. That is, new project participants may be introduced to research methodology or processes by using dedicated processes. Another import aspect is the repeatability of processes. If a data object is the result of an execution sequence of services, and all input parameters are known, the entire transformation process may be repeated. Further, processes may be restarted with slightly modified parameters or intermediate results are preserved to optimize re-executions [14]. Stevens et al. [46] point out an additional type of provenance: organizational provenance. Organizational provenance comprises information about who has transformed which data in which context. This kind of provenance is closely related to collaborative work and is applicable to medical cooperation projects. As data provenance has attracted more and more attention, it was integrated in well established scientific workflow systems [44]. For instance, a provenance recording was integrated for the Kepler [14], Chimera [24] and Taverna [34] workflow systems. Therefore, scientific research platform for biobanks could learn from the progress of data provenance research and incorporate suitable provenance recording mechanisms. In the context of biobanks, provenance recording can be applied to sample or medical-case data. If a data object is integrated from external sources (for instance, the follow up data of a patient), its associated provenance record may include an identification of its source as well as its access method. Additionally, the process of data generation during research activities may be documented.
Information Systems for Federated Biobanks
165
That is, if a data object is the result of a certain analysis, or is based on the processing of several other data objects, the transformation is captured in corresponding provenance records. If all data transformations are collected by recording the input and output data in relations, data dependency graphs may be built. From these graphs, data derivation graphs may be easily computed to answer provenance queries like: which input data was used to produce result X? Data provenance is closely related to the above-mentioned contextualization of objects. While contextualization enables to combine and structure objects from different sources, data provenance provides the inverse operation. It allows to trace the origins of used data objects. Thus, provenance has an important role regarding data quality assurance as it documents the process of data integration. 3.5
Architecture - Local Integration
Sample Management Systems. The necessity of elaborated sample management systems for biobanks was recognized in several biobank initiatives. Though, depending on the type of biobank and its research focus different systems have been implemented. For instance, the UK biobank adapted a laborartory information system (LIMS) supporting high throughput data capturing and automated quality control of blood and urine samples [23]. Since the UK biobank is a based on prospective collections of biological samples (more than 500,000) participants, the focus is clearly on optimized data capturing of samples and automatization techniques such as barcode reading. The UK Biorepository Information system [31] strives for supporting multicenter studies in context of lung cancer research. A manageable amount of samples is captured and linked to various types of data (life style, anamnesis data). As commercial systems lack flexibility and customization capabilities, a propriertary information system was designed and implemented. Another interesting system was presented in [15], supporting blood sample management in context of cancer research studies. The system clearly separates donor-related information (informed consents, contact information) from storage information of blood specimen and data extracted from epidemiologic questionnaires. In the context of the European Human Frozen Tissue Bank (TuBaFrost), a central tissue database for storing patient-related, tissue-related and image data was established [30]. Since biobanks typically provide material for medical research projects and studies, they are confronted with very detailed requirements from the medical domain. For instance, the following enquiry was sent to the MUG Biobank: A researcher requires about 10 samples with the following characteristics: – – – –
male patients paraffin tissue with diagnose liver cancer including follow-up data (e.g. therapy description) from the oncology
This example illustrates the diversity of search criteria that may be combined in a single enquiry. Criteria is defined on the type of sample (= paraffin tissue), the availability of material (= quantity available), on the medical case
166
J. Eder et al.
of diagnose (= liver cancer) and on the patient, as patient sex (= male) and follow-up data (= therapy description) are required. A catalogue of example enquiries should be included in the requirements analysis of the sample management system, as the system specification and design is to be strongly tailored to medical domain requirements. An other challenge exists in an appropriate representations of courses of disease. Patients suffering from cancer may be treated over several years and various tissue samples may be extracted and diagnosed. If a medical study on cancer is based on tissues of primary tumors and corresponding metastasis, it is important that the temporal dependency between the diagnosis of primary tumors and metastasis is captured correctly. Further, the causal dependency between the primary tumor and the metastasis need to be represented. Otherwise, a query would also return tissues with metastasis of secondary tumors. However, the design of a sample management system is strongly determined by the type, the quality and structure of available data. As already mentioned clinical records and diagnoses are frequently available as semi-structured data or even stored as plain-text in legacy medical information systems. SampleDB MUG. In the following we give a brief overview of the sample management system (SampleDB) of the MUG biobank. We present this system as an exemplary solution for a biobank integration project. For the sake of simplicity we only present the core-elements of the UML database schema in Figure 2. Generally, we may distinguish between three main perspectives: sample-information perspective, medical case perspective and sample-management perspective. The sample-information perspectives comprises information immediately related to the stored samples. All samples have a unique identifier, which is a histological number assigned by the instantaneous section of the pathology. Depending on the type of sample, different quality and quantity-related attributes may be stored. For instance, cryo tissue samples are conserved in vials in liquid nitrogen tanks. By contrast, paraffin blocks may be stored in large-scale robotic storage systems. Further, different attributes specifying the quality of samples exist such as the ischemic time of tissue or the degree of contamination of cell lines. That is, a special table exists for each type of sample (paraffin, cryo tissue, blood samples etc.). Samples may be used as basic material for further processing. For instance, RNA and DNA may be extracted from paraffin-embedded tissues. The different usage types of samples are modelled by the bottom classes in the schema. Operating a biobank requires an efficient sample management including inventory changes, documentation of sample usage and storing of cooperation contracts. The classes Borrow, Project and Coop Partner document which samples have left the biobank in which cooperation project and how many samples were returned. Since samples are a limited resource, they may be used up in context of a research project. For instance, paraffin-embedded tissues may be used to construct a tissue microarray, a high-throughput analysis platform for epidemiologybased studies or gene expression profiling of tumours [28]. Thus, for ensuring the sustainability of sample collections, appropriate guidelines and policies are required. In this context, samples of rare diseases are of special interest, since
Information Systems for Federated Biobanks
167
Fig. 2. Outline of the CORE schema from the SampleDB (MUG biobank)
they represent invaluable resources that may be used in multicenter studies [17]. The medical case perspective allows for assessing the relevant diagnostic information of samples. In the case of the MUG biobank, pathological diagnoses are captured and assigned to the corresponding samples. Since the MUG biobank has a clear focus on cancerous diseases, tumour-related attributes such as tumour grading and staging or the International Classification of Diseases for Oncology, ICD-O-3 classification are used [7]. Patient-related data is split in two tables: the sensitive data such as the personal data is stored in a separate table while an anonymous patient table contains a unique patient identifier. Personal data of patients are not accessible for staff members of biobanks. However, medical doctors may access sample and person-related data as part of their diagnostic or therapeutic work. 3.6
Data Integration
Data integration in the biomedical areas is an emerging topic, as the linkage of heterogeneous research, clinical and biobank information systems become more and more important. Generally, several integration architectures may be applied
168
J. Eder et al.
for incorporating heterogeneous data sources. Data warehouses extract and consolidate data from distributed sources in a global database that is optimized for fast access. On the other hand, in database federations data is left at the sources, and data access is accomplished by wrappers that map a global schema to distributed local schemas [33]. Although database federations deliver data that is up-to-date, they do not provide the same performance as data warehouses. However, they do not require redundant data storage and expensive periodic data extraction. In the context of the MUG biobank several types of information systems are accessed, as illustrated in figure 3. The different data sources are integrated in a database federation, whereas interface wrappers have been created for the relevant data. On the one hand, there are large clinical information systems which are used for routine diagnostic and therapeutical activities of medical doctors. Patient records from various medical institutes are stored in the OpenMedocs sytem, pathological data in the PACS system and laboratory data in the laboratory information system LIS. On the other hand research databases from several institutes (e.g. the Archimed system) containing data about medical studies are incorporated as well as the biological sample management system SampleDB and diverse robot systems. Further, survival data of patients is provided by the external institution Statistics Austria. Clinical and routine information systems (at the bottom of figure 3) are strictly seperated from operational information systems of the biobank. That is, sensitive patient-related data is only accessible for medical
Fig. 3. Data Integration in context of the MUG Biobank
Information Systems for Federated Biobanks
169
staff and anonymized otherwise. The MUG Biobank operates an own documentation system in order to protocol and coordinate all cooperation projects. The CSCW system at the top of the figure provides a scientific workbench for internal and external project partners, allowing to share data, documents, analysis results and services. A more detailed description of the system is given in section 5.1. A modified version of the CSCW workbench will be used as user interface for the European Bionbank initiative BBMRI, described in section 4.1. 3.7
Related Work
UK-Biobank. The aim of UK Biobank is to store health information about 500.000 people from all around the UK who are aged between 40-69. UK Biobank has evolved over several years. Many subsystems, processes and even the system architecture have been developed from experience gathered during pilot operations [5]. UK Biobank integrated many different subsystems to work together. To ensure access to a broad range of third party data sets it was essential that UK Biobank meets the needs of other relevant groups (e.g. Patient Information Advisory Group). Many external requirements had to be taken into consideration to fulfil that needs [13]. Figure 4 shows a system overview of the UK Biobank and its most important components. The recruitment system is responsible to process patient invitation data received from the National Health Service. The received data has to be cleaned and passed to the Participant Booking System. The Booking System securely transfers appointment data (name, date of birth, gender, address, ...) to the
Fig. 4. System architecture showing the most important system components of the UK Biobank [13]
170
J. Eder et al.
Assessment Data Collection System. The Assessment Center also handles the informed consent of each participant. The task of LIMS is to store identifiers for all received samples without any participant identifying data such as name, address, etc. The UK Biobank also provides interfaces to clinical and non-clinical external data repositories. The Core Data Repository containing different data repositories forms the basis for several different Data Warehouses. These Data Warehouses provide all parameters needed to generate appropriate datasets for answering validated research requests. Also disclosure control which prevents patient de-identification is performed on these Data Warehouses. The research community is able to post requests with the help of a User Portal which is positioned right on the top of the Data Warehouses. Additional Query Tools allow investigating the Data Warehouses as well as the Core Data Repository [13]. caBIG. The cancer Biomedical Informatics Grid (caBIG) has been initiated by the National Cancer Institute (NCI) as a national-scale effort in order to develop a federation of interoperable research information systems. The approach to reach federated interoperability is a grid middleware infrastructure, called caGrid. It is designed as a service-oriented architecture. Resources are exposed to the environment as grid services with well-defined interfaces. Interaction between services and clients is supported by grid communication and service invocation protocols. The caGrid infrastructure consists of data, analytical and coordination services which are required by clients and services for grid-wide functions. According to [40] coordination services include services for metadata management, advertisement and discovery, query and security. A key characteristic of the framework is its focus on metadata and model driven service development and deployment. This aspect of caGrid is particularly important for the support of syntactic and semantic interoperability across heterogeneous collections of applications. For more information see [40,3]. CRIP. The concept of CRIP (Central Research Infrastructure for molecular Pathology) enables biobanks to annotate projects with additional necessary data and to transfer them into valuable research resources. CRIP has been started in the beginning of 2006 by the departments of Pathology of Charit and the Medical University of Graz (MUG) [41]. CRIP offers a virtual simultaneous access to tissue collections of participating pathology archives. Annotated valuable data comes from different heterogeneous datasources and is stored in a central CRIP database. Academics and researchers with access rights are allowed to search for interesting material. Workflows and data transfers of CRIP projects are regulated in a special contract between CRIP partners and Fraunhofer IBMT.
4
Federation of Biobanks
Currently established national biobanks and biomolecular resources are a unique European strength, valuable collections typically suffer from fragmentation of the European biobanking-related research community. This hampers the collation of
Information Systems for Federated Biobanks
171
biological samples and data from different biobanks required to achieve sufficient statistical power. Moreover, it results in duplication of effort and jeopardises sustainability due to the lack of long-term funding. To overcome the issues stated above a federation of biobanks can be used to provide access to comprehensive data and sample sets thus achieving results with better statistical power. Further on, it is possible to investigate rare and highly diverse diseases as well as saving high costs caused by duplicate analysis of the same material or data. To benefit European health-care, medical research, and ultimately, the health of the citizens of the European Union the European Commission is funding a biobank integration project called BBMRI (Biobanking and Biomolecular Resources Infrastructure). 4.1
Biobanking and Biomolecular Resources Infrastructure
The aim of BBMRI is to build a coordinated, large scale European infrastructure of biomedically relevant, quality-assessed mostly already collected samples as well as different types of biomolecular resources (antibody and affinity binder collections, full ORF clone collections, siRNA libraries). In addition to biological materials and related data, BBMRI will facilitate access to detailed and internationally standardised data sets of sample donors (clinical data, lifestyle and environmental exposure) as well as data generated by analysis of samples using standardised analysis platforms. A large number of platforms (such as high-throughput sequencing, genotyping, gene expression profiling technologies, proteomics and metabolomics platforms, tissue microarray technology etc.) will be available through BBMRI infrastructure [2]. Benefits. The benefits of BBMRI are versatile. Talking in short-terms BBMRI leads to an increased quality of research as well as to a reduction of costs. The mid-term impacts of BBMRI can be seen in an increased efficacy of drug discovery/development. Long-term benefits of BBMRI are improved health care possibilities in the area of personalized medicine/health care [11]. Data Harmonisation and IT-infrastructure. An important part of BBMRI is responsible for designing the IT-infrastructure and database harmonisation, which includes also solutions for data and process standardization. The harmonization of data deals with the identification of the scope of needed information and data structures. Further on, it analysis how available nomenclature and coding systems can be used for storing and retrieving (heterogenous) biobank information. Several controlled terminologies and coding systems may be used for organizing the information about biobanks [11,35]. Since not all medical information is fully available in the local databases of biobanks the retrieval of data involves big challenges. That implies the necessity of flexible data sharing and collaboration between centers. 4.2
Enquiries in a Federation of Biobanks
Within an IT-infrastructure for federated biobanks authorized researchers should have the possibility to search and obtain required material and data from all
172
J. Eder et al.
participating biobanks, necessary e.g. to perform biomedical studies. Furthermore, it should be possible to even update or link data from already performed studies in the European federation. In the following we distinguish between five different kinds of use cases: 1. Identification of biobanks. Retrieves a list with contact data from participating biobanks which have desired material for a certain study. 2. Identification of cases. Retrieves the pseudonym identifiers of cases1 stored in biobanks which correspond to a given set of parameters. 3. Retrieval of data. Obtains available information (material, data, etc.) directly from a biobank for a given set of parameters. 4. Upload or linking of data. Connecting samples with data generated from this sample internally and externally. 5. Statistical queries. Performs analytical queries on a set of biobanks. Extracting, categorizing and standardizing almost semi-structured records is a strenuous task postulating medical domain knowledge and a strong quality control. Therefore, an automated retrieval of data involves big challenges. Further on, to enable the retrieving, upload and linking of data a lot of research, harmonization and integration has to be done. For the moment we assume that researchers use the contact information and pseudonym identifiers to retrieve data from a biobank. An important issue within the upload and linking of data is the question how the data has been generated. The generation of new data must follow a standardized procedure with preferably uniform tools, ontologies etc. In this context also data quality as well as data provenance play an important role. In section 3 we discussed issues within one biobank as integration project, now we are concerned with a set of heterogenous biobanks as integration project. There exist several different proposals for the handling of enquiries within the BBMRI project. Our approach for enquiries is to ascertain where desired material or data is located. Subsequently the researchers can get in contact with that biorepository by themselves. It comprises the first two use cases mentioned above. Workflow for Enquiries. In figure 5 we have modeled a possible workflow for the identification of biobanks and cases, separated into different responsibility parts. The most important participants within this workflow are the requestor (researcher,), the requestor’s BBMRI host, other BBMRI hosts and biobanks. Hosts act as global coordinators within the federation. The registration of biobanks on BBMRI hosts takes place via a hub and spoke structure. In the first step of our workflow an authenticated researcher chooses a service request from a list of available services. Since a request on material or medical data can have different conditions, a suggestion is to provide a list of possible request templates like: – Biobanks with diseased samples (cancer) – Biobanks with diseased samples (metabolic) 1
In our context a case is a set of jointly harvested samples of one donor.
Information Systems for Federated Biobanks
173
– Cases with behavioral progression of a specific kind of tumor – Cases with commonalities of two or more tumors – ... After the selection of an appropriate service request the researcher can declare service-specific filter criteria to constrain the result according to the needs. Additionally, the researcher is able to specify a level of importance for each filter
Fig. 5. Workflow for identification of biobanks and cases separated into different responsibility parts
174
J. Eder et al.
criteria. This level of importance is an interval between 1 and 5 with 1-lowest relevance and 5-highest relevance. Without any specification the importance of the filter criteria is treated as default-value 3-relevant. The level of importance has direct effects on the output of the query. It is used for three major purposes: – Specifying must-have values for the result. If the requestor defines the highest level of importance for an attribute the query only returns databases that match exactly. – Specifying nice-to-have values for the result. This feature relaxes query formulations in order to incorporate the aspect of semi-structured data. – Ranking the result to show the best matches at the topmost position. The ranking algorithm takes the resulting data of the query invocation process and sorts the output according to the predefined levels of importance. Researchers formulate their requests with the use of query by example. BBMRI then dissembles to act as one single system, performing query processing and disclosure of information from the participating biobanks transparent. According to this the formulated query of the researcher is sent to the requestor’s national host as xml-document (see xml-document below). Afterwards the national host (1) distributes the query to the other participating hosts in the federation using disclosure information, (2) queries its own meta data repository as well as the local databases from all registered biobanks, (3) applies a disclosure filter and (4) ranks the result. Each invoked BBMRI-Host in the federation performs the same procedure as the national host, but without distributing the incoming query. All distributed ranked query results are sent back to the requestor’s host and are merged on it. Depending on how the policy is specified the researcher gets a list of biobanks or individual cases of biobanks as the final result for the enquiry. Afterwards the researcher can get in contact with the desired biobanks. In case of an insufficient result set the researcher has the opportunity to constrain the result set or even to refine the query. In the following we discuss different scenarios for enquiries in a federation of biobanks. The scenarios differ in the – kind of data accessed (only accessing the host’s meta-database or additionally accessing the local databases from registered biobanks) – information contained in the result (list of biobanks and its contact data or individual anonymized cases from biobanks). We assume that the meta-database stored on each host within the federation only contains the uploaded schema information of the registered biobanks, in order to avoid enormous data redundancies and rigidity in the system. Scenario 1 - The Identification of Biobanks. Within the identification of biobanks enquiries from authorized researchers are performed only by searching the meta-databases from the federated hosts. Unfortunately this may lead to very sketchy requests. An idea is to provide a small set of attributes in the
Information Systems for Federated Biobanks
175
meta-database with the opportunity to specify a certain content, given that this content is an enumeration. A good candidate, for example, is the attribute ”diagnose” standardized as ICD-Code (ICD-9, ICD-10, ICDO-3) because it may be very useful to know which biobank(s) store information about specific diseases. Another good canditate may be ”Sex ” with the values ”Female”, ”Male”, ”Unknown”,etc.. A considerable point of view within enquires for the identification of biobanks is supporting the possibility to specify an order of magnitude for the desired material. Example enquiry 1: A researcher wants to know which biobanks store 10 samples with the following characteristics: – – – –
male patients paraffin tissue with diagnose liver cancer including follow-up data (e.g. therapy description) from the oncology
XML Output for Example Enquiry 1 after definition of Filter Criteria <service s-id="1" name="Biobanks with diseased samples (cancer)">
Scenario 2 - The Identification of Cases. Since the meta-database located on the hosts does not store any case or donor related information it is necessary to additionally query the local databases. Take note that no information
176
J. Eder et al.
of the local databases will be sent to the requestor except unique identifiers of the appropriate cases. The querying of the local databases is used for searching more detailed and therefore to get a more exact result list. A special case within the identification of cases is determined by a slight variance in the result set. Depending on a policy the result set can also contain only a list of biobanks with their contact information as discussed in scenario 1. Example enquiry 2: A researcher requires the ID of about 20 cases and their location (biobank) with the following characteristics: – – – – –
paraffin tissue with diagnose breast cancer staging T1 N2 M0 from donors of age 40-50 years including follow-up data (e.g. therapy) from the oncology
There are two types of relationships between the samples: donor-related and case-related relationships. Donor-related means that two or more samples have been taken from the same donor. Though, the samples may have been taken in various medical contexts (different diseases, surgeries, etc.). In contrast, samples are case-related when the associated diagnoses belong to the same disease. 4.3
Data Sharing and Collaboration between Different Biobanks
A desirable point of view in our discussions was to build an IT-infrastructure which provides the ability to easily adapt to different research needs. In regard to this we designed a feasible environment for the collaboration between different biobanks within BBRMI as a hybrid of peer to peer and a hub and spoke structure. In our approach a BBMRI-Host (figure 6) represents a domain hub in the IT-infrastructure and uses a meta structure to provide data sharing. Several domain hubs are connected via a peer-to-peer structure and communicate with each other via standardized and shared Communication Adapters. Each participating European biobank is connected with its specific domain hub resp. BBMRI-Host via hub and spoke-structure. Biobanks provide their obtainable attributes and contents as well as their contact data and biobank specific information via the BBMRI Upload Service of the associated BBMRI-Host. A Mediator coordinates the interoperability issues between BBMRI-Host and the associated biobanks. The information about uploaded data from each associated biobank is stored in the BBMRI ContentMeta-Structure. Permissions related to the uploaded data as well as contracts between a BBMRI-Host and a specific biobank are managed by the Disclosure Filter. A researcher can use the BBMRI Query Service for sending requests to the federated system. The BBMRI Query Service is the entry point for such requests. The BBMRI Query Service can be accessed via the local workbenches of connected biobanks as well as via the BBMRI Scientific Workbenches.
Information Systems for Federated Biobanks
4.4
177
Data Model
Our proposed data models for the BBMRI Content-Meta-Structure (in figure 6) have the ability to hold a sufficient (complete) set of needed information structures. The idea is that obtainable attributes (schema information) from local databases of biobanks can be mapped with the BBMRI Content-Meta-Structure in order to provide a federated knowledge-base for life-science-research. To avoid data overkill we designed a kind of lower-bound schema that contains attributes usually occurring in most or even all of the participating biobanks. Our approach was to accomplish a hybrid-solution of a federated system and an additional data warehouse as a kind of index to primarily reduce the query overhead. This decision led to the design of a class (named ContentInformation,
Fig. 6. Architecture of IT-infrastructure for BBMRI
178
J. Eder et al.
figure 7) which contains attributes with different meanings, similar to online analytical processing (Olap), including: – Content-attributes. Are a small set of attributes, which provide information about their content in the local database (cf. 4.2). All content-attributes must be provided by each participating biobank. The type of attributes stored in the meta-dataset with content information must be an enumeration like ICD-Code, patient sex or BMI-category. – Number of Cases (NoC). Is an order of magnitude for all available cases of a specific disease in combination with all content-attributes. – Existence-attributes accept two different kinds of characteristics: 1. Value as quantity. This kind of existence attributes tell how many occurrences of a given attribute are available in a local database for a specific instance tuple of all content-attributes. The knowledge is represented by a numeric value greater than a defined k-value for an aggregated set of cases. Conditions on this kind of existence-attributes are OR-connected because they are independent from each other. They do not give information on their inter-relationship. 2. Value as availability. With that kind of existence attributes the storage of values does not take place in an aggregated form like mentioned above, but as bitmap (0 / 1), with 0 not available and 1 available. This has the consequence that each row in the relation contains one specific case. Due to this fact AND-connected conditions on existence attributes can be answered. We compared two different approaches for the data model of the BBMRI ContentMeta-Structure, a static and a dynamic one. The Static Approach. In comparison to an Olap data-cube our class ContentInformation (see figure 7) acts as the fact-table with the content-attributes as dimensions and existence-attributes (including the Number of Cases) as measures. We call this approach static because it enforces a data model which includes a common set of attributes on that all participating biobanks have to agree. Biobanks have the opportunity to register their schema information via an attribute catalogue and provide their content information as shown in figure 7. Within the static approach the splitting of the set of obtainable attributes from the participating biobanks into content-attributes and existence-attributes
Fig. 7. Example data of class ContentInformation as static approach
Information Systems for Federated Biobanks
179
makes data analysis more performant because queries do not get too complex. Also Olap operations like roll-up and drill-down are enabled. A serious handicap is the missing flexibility to store information of biobanks that work in different areas. In this case all biobanks must use the same content information. The Dynamic Approach. The previous stated data model for the BBMRIContent-Meta-Structure works on a static defined set of attributes coupled as ContentInformation. Thus, all participating biobanks have the same ContentInformation. However a metabolic based biobank does not necessarily need resp. have a TNM-Classification and otherwise a cancer based biobank does not necessarily need resp. have metabolic specific attributes. Because of this assumption this datamodel deals with a dynamic generation of the ContentInformation (figure 9). I.e. Each biobank first declares the attributes they store in their local databases. Especially they declare which of them are content-attributes and which of them are existence-attributes. However this could affect requests on material, therefore one must be careful with the declaration. – Requests on existence of attributes. For a request on the existence of several attributes it does not matter whether the requested attributes are declared as content-attribute or existence-attribute. The only fact to get a query-hit for that request is that the searched attributes are declared by a biobank. – Requests on content of attributes. For a request on the content of several attributes, all requested attributes must be declared as content-attribute by a biobank in order to get a query-hit. E.g. A request on female patients who suffer from C50.8 (breast cancer) would not get a query-hit from BB-y (figure 8) because the attribute PatientSex is declared as existence-attribute and thus has no information about its content.
Fig. 8. Explicit declaration of attributes in the dynamic approach
180
J. Eder et al.
Fig. 9. Dynamic generation for content information of BBMRI content meta structure
With the dynamic data model it is possible to support different kinds of content information depending on the needs of the biobanks. Besides once a new attribute is introduced, this does not lead to changes in the database schema. In the following table 1 we compare the static data model with the dynamic data model. Table 1. Comparison between static and dynamic approach Approach Flexibility in mainte- Simplicity in query Anonymity issues nance static + + dynamic + +
4.5
Disclosure Filter
The disclosure filter is a software component that helps the BBMRI-Hosts to answer the following question: Who is allowed to receive what from whom under which circumstances? E.g: Since it is planned to provide information exchange across national boarders the disclosure filter has to ensure that no data (physical or electronic) leaves the country illegally. The disclosure filter takes into account laws, contracts between participants, policies of participants and even rulings (e.g: by courts, ethics boards,...). The disclosure filter on a BBMRI-Host plays three different roles: 1. Provider host and local biobank remove items from query answers that are not supposed to be seen by requestors.
Information Systems for Federated Biobanks
181
2. Technical Optimization: Query system optimizes query processing using disclosure information. 3. Requestor host removes providers which do not provide sufficient information to the requestor. This role can be switched on / off. The disclosure filter plays a central role in the workflow (figure 5) for use cases 1 and 2 as well as in the architecture (figure 6). Depending on the role of the disclosure filter the location within the workflow can change. During requirements analysis the possibility to switch the disclosure filter off on demand turned out to be an important feature. With the help of this feature it is possible to operate a more relaxed system.
5
Working with Biobanks
In this chapter, we want to point out some application areas of IT infrastructures in the context of biobanks. We mainly focus on support of medical research, since there is a strong demand for assisting, documenting and interconnecting research activities. Additionally, these activities are tightly coupled with the data management of a biobank, providing research results for samples and thereby enhancing the scientific value of samples. 5.1
Computer Supported Collaborative System (CSCW) for Medical Research
Medical research is a collaborative process in an interdisciplinary environment that may be effectively supported by a CSCW system. Such a system imposes specific requirements in order to allow flexible integration of data, analysis services and communication mechanisms. Persons with different expertise and access rights cooperate in mutually influencing contexts (e.g. clinical studies, research cooperations). Thus, appropriate virtual environments are needed to facilitate context-aware communication, deployment of biomedical tools as well as data and knowledge sharing. In cooperation with the University of Paderborn we were able to leverage a CSCW system, that covers our demands, on the flexible service-oriented architecture Wasabi, a reimplementation of Open sTeam (www.open-steam.org) which is widely used in research projects to share data and knowledge and cooperate in virtual knowledge spaces. We use Wasabi as a middleware integrating distributed data sources and biomedical services. We systematically elaborated the main requirements of a medical CSCW system and designed a conceptual model, as well as an architectural proposal satisfying our demands [42,45]. Finally we implemented a virtual workbench to support collaboration activities in medical research and routine work. This workbench had to fulfill several important requirements, in particular: – R(1) User and Role Management. The CSCW has to be able to cope with the organisational structure of the institutes and research groups of the hospital. Data protection directives have to fit in the access right model of
182
–
–
–
–
–
J. Eder et al.
the system. Though, the model has to be flexible to allow the creation of new research teams and information sharing across organisational borders. R(2) Transparency of physical Storage. Although data may be stored in distributed locations, data retrieval and data storage should be solely dependant on access rights, irrespective of the physical location. That is, the complexity of data structures is hidden from the end user. The CSCW system has to offer appropriate search, join and transformation mechanisms. R(3) Flexible Data Presentation. Since data is accessed by persons having different scientific background (biological, medical, technical expertise) in order to support a variety of research and routine activities, flexible capabilities to contextualise data are required. Collaborative groups should be able to create on-demand views and perspectives, annotate and change data in their contexts without interfering with other contexts. R(4) Flexible Integration and Composition of Services. A multitude of data processing and data analysis tools exist in the biomedical context. Some tools act as complementary parts in a chain of processing steps. For example, to detect genes correlated with a disease, gene expression profiles are created by measuring and quantifying gene activities. The resulting gene expression ratios are normalised and candidate genes are preselected. Finally, significance analysis is applied to identify relevant genes [49]. Each function may be proR R and Genesis [9,47]. In vided by a separate tool - for example by Genespring some cases tools provide equal functionality and may be chosen as alternatives. Through flexible integration of tools as services with standardised input and output interfaces a dynamic composition of tools may be accomplished. From the systems perspective services are technology neutral, loosely coupled and support location transparency [36]. The execution of services is not limited to proprietary operation systems and service callers do not know the internal structure of a service. Further, services may be physically distributed over departments and institutes, e.g. image scanning and processing is executed in an own laboratory where the gene expression slides reside. R(5) Support of cooperative Functions. In order to support collaborative work suitable mechanisms have to be supplied. One of the main aspects is the common data annotation. Thus, data is augmented and shared within a group and new content is created cooperatively. Therefore, Web 2.0 technologies like wikis and blogs procure a flexible framework for facilitating intra- and inter-group activities. R(6) Data-coupled Communication Mechanisms. Cooperative working is tightly coupled with excessive information exchange. Appropriate communication mechanisms are useful to coordinate project activities, organise meetings and enable topic-related discussions. On the one hand, a seamless integration of email exchange, instant messaging and VoIP tools facilitates communication activities. We propose to reuse the organisational data defined in R(1) within the communication tools. On the other hand, persons should be able to include data objects in their communication acts. E.g. Images of diseased tissues may be diagnosed cooperatively, whereas marking and annotating of image sections supports the decision making process.
Information Systems for Federated Biobanks
183
– R(7) Knowledge Creation and Knowledge Processing. Cooperative medical activities frequently comprise the creation of new knowledge. Data sources are linked with each other, similarities and differences are detected, and involved factors are identified. Consider a set of genes that is assumed to be strongly correlated with the genesis of a specific cancer subtype. If the hypothesis is verified the information may be reused in subsequent research. Thus, methods to formalise knowledge, share it in arbitrary contexts and deduce new knowledge are required. In the following Figure 10, the concept of virtual knowledge spaces is illustrated. The main idea is to contextualize documents, data, services, annotations in virtual knowledge spaces and make them accessible for cooperating individuals. Rooms may be linked to each other or nested in each other. Communication is tightly coupled to the shared resources, enabling discussion, quality control and collaborative annotations.
Fig. 10. Wasabi virtual room
5.2
Workflow for Gene Expression Analysis for the Breast Cancer Project
A detailed breast cancer data set was annotated at the Pathology Graz. In this context much emphasis is put on detecting deviations in the behaviour of gene groups. We support the entire analysis workflow by supplying an IT research platform allowing to select and group patients arbitrarily, preprocess and link the related gene expressions and finally perform state-of-the-art analysis algorithms. We developed an appropriate database structure with import/export methods allowing to manage arbitrary medical data sets and gene expressions. We also implemented web service interfaces to various gene expression analysis algorithms. Currently, we are able to support the following steps of the research workflow: (1) Case Selection: In the first step relevant medical cases are selected. The set of avalaible breast cancer cases with associated cryo tissue samples is selected by querying the SampleDB. Since only cases with follow-up data from the oncology
184
J. Eder et al.
Fig. 11. Workflow for the support of gene expression analysis
are included in the project, those cases are filtered. Further filter criteria are: metastasis and therapeutic documentation. Case selection is a composed activity, as two separate databases of two different institutes (pathology and oncology) are accessed, filtered and joined. After selecting the breast cancer cases, gene expression profiles may be created. Output description: Set of medical cases. Output type: File or list of appropriate medical cases identified by unique keys like patient ID. (2) Normalization of Gene Expression Profiles: A set of GPR files is defined as input source and the preferred normalisation method is applied. We use the normalisation methods offered by the bioconductor library of the R-project. The result of the normalisation is stored for further processing. Output description: Normalised gene expression matrix Output type: The result of the normalisation is a matrix where rows correspond to genes and columns to medical cases. The matrix may be stored as file or table. (3) Gene Annotation: In order to link genes with other resources, unique gene identifiers are required (e.g. Ensemble GeneID, RefSeq). Therefore, we integrated mapping data supplied by the Operon chip producer. Output description: Annotated gene expression matrix Output type: Gene expression matrix with chosen gene identifiers. The matrix may be stored as file or table.
Information Systems for Federated Biobanks
185
(4) Link Gene Ontologies: We use gene ontologies (www.geneontology.org) in order to map single genes to functional groups. Therefore, we imported the most recent gene ontologies into our database. Alternatively we also plan to integrate pathway data from the KEGG database (www.genome.jp/kegg/) to allow grouping of genes into functional groups. Output description: Mapping from gene groups (gene ontologies, KEGG functional groups) to single genes. Output type: List of mappings, where in each mapping a group is mapped to a list of single genes. (5)Link annotated Patient Data: Each biological sample corresponds to a medical case and has to be linkable to the gene expression matrix. A file containing medical parameters is imported whereas the parameters may be used to define groups of interest for the analysis. Output description: A table storing all medical parameters for all cases. Output type: A database table is created allowing to link medical parameters to cases of the annotated gene expression matrix. (6) Group Samples: A hypothesis is formulated by defining groups of medical cases that are compared in the analysis. The subsequent analysis tries to detect significant differences in gene groups between the medical groups. Output description: The medical cases are grouped according to the chosen medical parameters. Output type: A list of mappings whereas each mapping consists of a unique case identifier mapping to a group identifier. (7) Analysis: We implemented web service interfaces to the Bioconductor packages ’Global Test’ [25] and ’Global Ancova’[29]. We use the selected medical parameters for sample grouping and the GO categories for gene grouping together with the gene expression matrix as input parameters for both algorithms. After the analysis is finished the results are written into an analysis database and exported as Excel files. We also plan to integrate an additional analysis tool called Matisse from Tel Aviv University [50]. Output description: A list of significant gene groups. The number of returned gene groups may be customized. For instance, only the top 10 significant gene groups are returned. Output type: A list of significant gene groups, together with a textual description of the group and its p-value. (8) Plotting: Gene plots may be created, visualizing the influence of single genes in a significant gene group. The plots are created using bioconductor libraries which are encapsulated in a web service. Output description: Gene plots of significant gene groups. Output type: PNG image files, that may be downloaded and saved into an analysis database. We are able to show that a service-oriented CSCW system provides the functionality to build a workbench for medical research supporting the collaboration
186
J. Eder et al.
of researchers, allowing the definition of workflows and gathering all necessary data for maintaining provenance information.
6
Data Privacy and Anonymization
When releasing patient-specific data (e.g. in medical research cooperations) privacy protection has to be guaranteed for ethical and legal reasons. Even when immediately identifying attributes like name, address or day of birth are eliminated, other attributes (quasi-identifying attributes) may be used to link the released data with external data to re-identify individuals. In recent research much effort has been put on privacy preserving and anonymization methods. In this context, k-anonymity [48] was introduced allowing to protect sensitive data by generating a sufficient number of k data twins. These data twins prevent that sensitive data is linkable to individuals. K-anonymity may be accomplished by: – transforming attribute values to more general values - nominal and categorical attributes may be transformed by taxonomy trees or user-defined generalization hierarchies – mapping numerical attributes to intervals (for instance, age 45 may be transformed to age interval 40-50) – replacing a value with a less specific but semantically consistent value (e.g. replace numeric (continuous) data for blood pressure with categorical data like ’high’ blood pressure) – combining several attributes making them more coarse grain (e.g. replace height and weight with bmi) – fragmentation of the attribute vector – data blocking (i.e. by replacing certain attributes of some data items with a null value) – dropping a sample from the selection set – dropping an attribute For a given data set, several k-anonymous anonymizations may be created depending on how attributes are generalized. Transformations of attribute values are always accompanied by an information loss, which may be used as a quality criteria for an anonymization. That is, an optimal anonymization may be defined as the k-anonymous anonymization with the minimal information loss. Information, value of information and the significance of information loss is in the eye of the beholder, i.e. it depends on the requirements of the intended analysis. Only the purpose can tell which of the generalizations is more suited and gives more accurate results. Therefore, we developed a tool called Open anonymizer (see https://sourceforge.net/projects/openanonymizer) which is based on individual attribution of information loss. We implemented the anonymization algorithm as a Java web application that may be deployed on a web application server and accessed by a web browser. Open anonymizer is a highly customizable anonymization tool providing the best anonymization for a certain
Information Systems for Federated Biobanks
187
context. The anonymization process is strongly influenced by data quality requirements of users. We allow users to specify the importance of attributes as well as transformation limits for attributes. These parameters are considered in the anonymization process, which delivers a solution that is guaranteed to fulfil the user requirements and has a minimal information loss. Open anonymizer provides a wizard-based, intuitive user interface which guides the user through the anonymization process. Instead of anonymizing the entire data set of a data repository, a simple query interface allows to extract relevant subsets of data to be anonymized. For instance, in a biomedical context, diagnoses of a certain carcinoma type may be selected, anonymizsed and released without considering the rest of the diagnoses.
7
Conclusion
Biobanks are challenging application areas for advanced information technology. The foremost challenges for the information system support in a network of biobanks as envisioned in the BBMRI project are the following: – Partiality: A biobank is intended to be one node in a federation (cooperative network) of biobanks. It needs the descriptive capabilities to be useful for other nodes in the network and it needs the capability to make use of other biobanks. This needs careful design of metadata about the contents of the biobank, the acceptance and interoperability of heterogeneous partner resources. On the other hand a biobank will rely on data generated and maintained in other systems (other centres, hospital information systems, etc.). – Auditability: Managing the provenance of data will be essential for advanced biomedical studies. Documenting the origins and the quality of data and specimens, documenting the sources used for studies and the methods and tools and results of studies is essential for the reproducibility of results. – Longevity: A biobank is intended to be a long lasting research infrastructure and thus many changes will occur during its lifetime: new diagnostic codes, new therapies, new analytical methods, new legal regulations, and new IT standards. The biobank needs to be ready to incorporate such changes and to be able to make best use of already collected data in spite of such changes. – Confidentiality: A biobank stores or links to patient related data. Personal data and genomic data are considered highly sensitive in many countries. The IT-infrastructure must on the one hand provide means to protect the confidentiality of protected data and on the other enable the best possible use of data for studies respecting confidentiality constraints. We presented biobanks and discussed the requirements for biobank information systems. We have shown that many different research areas within the Databases and Information Systems field contribute to this endeavor. We were just able to show some examples: advanced information modeling, (semantic) interoperability, federated databases, approximate query answering, result ranking, computer supported cooperative work (CSCW), and security and privacy. Some well
188
J. Eder et al.
known solutions from different application areas have to be revisited given the size, heterogeneity, diversity dynamics, and complexity of data to be organized in biobanks.
References 1. Biobankcentral, http://www.biobankcentral.org 2. Biobanking and biomolecular resources research infrastructure (bbmri), http://www.bbmri.eu 3. Cabig - cancer biomedical informatics grid, https://cabig.nci.nih.gov 4. Geneontology, http://www.geneontology.org 5. Uk-biobank, http://www.ukbiobank.ac.uk 6. Who: International statistical classification of diseases and related health problems. 10th revision version for 2007 (2007) 7. Who: International classification of diseases for oncology, 3rd edn., icd-o-3 (2000) 8. Organisation for economic cooperation and development: Biological resource centres: Underpinning the future of life sciences and biotechnology (2001) 9. Genespring: Cutting-edge tools for expression analysis (2005), http://www.silicongenetics.com 10. Nih guide: Informed consent in research involving human participants (2006) 11. Bbmri: Construction of new infrastructures - preparatory phase. INFRA–2007– 2.2.1.16: European Bio-Banking and Biomolecular Resources (April 2007) 12. Organisation for economic cooperation and development. best practice guidelines for biological resource centres (2007) 13. Uk biobank: Protocol for a large-scale prospective epidemiological resource. Protocol No: UKBB-PROT-09-06 (March 2007) 14. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system, pp. 118–132 (2006) 15. Ambrosone, C.B., Nesline, M.K., Davis, W.: Establishing a cancer center data bank and biorepository for multidisciplinary research. Cancer epidemiology, biomarkers & prevention: a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 15(9), 1575–1577 (2006) 16. Asslaber, M., Abuja, P., Stark, K., Eder, J., Gottweis, H., Trauner, M., Samonigg, H., Mischinger, H., Schippinger, W., Berghold, A., Denk, H., Zatloukal, K.: The genome austria tissue bank (gatib). Pathobiology 2007 74, 251–258 (2007) 17. Asslaber, M., Zatloukal, K.: Biobanks: transnational, european and global networks. Briefings in functional genomics & proteomics 6(3), 193–201 (2007) 18. Cambon-Thomsen, A.: The social and ethical issues of post-genomic human biobanks. Nat. Rev. Genet. 5(11), 866–873 (2004) 19. Chamoni, P., Stock, S.: Temporal structures in data warehousing. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 353–358. Springer, Heidelberg (1999) 20. Eder, J., Dabringer, C., Schicho, M., Stark, K.: Data management for federated biobanks. In: Proc. DEXA 2009 (2009) 21. Eder, J., Koncilia, C.: Changes of dimension data in temporal data warehouses. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 284–293. Springer, Heidelberg (2001)
Information Systems for Federated Biobanks
189
22. Eder, J., Koncilia, C., Morzy, T.: The comet metamodel for temporal data warehouses. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 83–99. Springer, Heidelberg (2002) 23. Elliott, P., Peakman, T.C.: The uk biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. International Journal of Epidemiology 37(2), 234–244 (2008) 24. Foster, I., V¨ ockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management, pp. 37–46 (2002) 25. Goeman, J.J., van de Geer, S.A., de Kort, F., van Houwelingen, H.C.: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20(1), 93–99 (2004) 26. Goos, G., Hartmanis, J., Sripada, S., Leeuwen, J.V., Jajodia, S.: Temporal Databases: Research and Practice. Springer, New York (1998) 27. Gottweis, H., Zatloukal, K.: Biobank governance: Trends and perspectives. Pathobiology 2007 74, 206–211 (2007) 28. Hewitt, S.: Design, construction, and use of tissue microarrays. Protein Arrays: Methods and Protocols 264, 61–72 (2004) 29. Hummel, M., Meister, R., Mansmann, U.: Globalancova: exploration and assessment of gene group effects. Bioinformatics 24(1), 78–85 (2008) 30. Isabelle, M., Teodorovic, I., Morente, M., Jamin´e, D., Passioukov, A., Lejeune, S., Therasse, P., Dinjens, W., Oosterhuis, J., Lam, K., Oomen, M., Spatz, A., Ratcliffe, C., Knox, K., Mager, R., Kerr, D., Pezzella, F.: Tubafrost 5: multifunctional central database application for a european tumor bank. Eur. J. Cancer 42(18), 3103–3109 (2006) 31. Kim, S.: Development of a human biorepository information system at the university of kentucky markey cancer center. In: International Conference on BioMedical Engineering and Informatics, vol. 1, pp. 621–625 (2008) 32. Litwin, W., Mark, L., Roussopoulos, N.: Interoperability of multiple autonomous databases. ACM Comput. Surv. 22(3), 267–293 (1990) 33. Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., Tarczy-Hornoch, P.: Methodological review: Data integration and genomic medicine. J. of Biomedical Informatics 40(1), 5–16 (2007) 34. Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008) 35. Muilu, J., Peltonen, L., Litton, J.: The federated database - a basis for biobankbased post-genome studies, integrating phenome and genome data from 600 000 twin pairs in europe. European Journal of Human Genetics 15, 718–723 (2007) 36. Papazoglou, M.P.: Service-oriented computing: concepts, characteristics and directions. In: Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, pp. 3–12 (2003) 37. Ram, S., Liu, J.: A semiotics framework for analyzing data provenance research. Journal of computing Science and Engineering 2(3), 221–248 (2008) 38. Rebulla, P., Lecchi, L., Giovanelli, S., Butti, B., Salvaterra, E.: Biobanking in the year 2007. Transfusion Medicine and Hemotherapy 34, 286–292 (2007) 39. Riegman, P., Morente, M., Betsou, F., de Blasio, P., Geary, P.: Biobanking for better healthcare. In: The Marble Arch International Working Group on Biobanking for Biomedical Research (2008)
190
J. Eder et al.
40. Saltz, J., Oster, S., Hastings, S., Langella, S., Kurc, T., Sanchez, W., Kher, M., Manisundaram, A., Shanbhag, K., Covitz, P.: Cagrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 22(15), 1910–1916 (2006) 41. Schroeder, C.: Vernetzte gewebesammlungen fuer die forschung crip. Laborwelt 5, 26–27 (2007) 42. Schulte, J., Hampel, T., Stark, K., Eder, J., Schikuta, E.: Towards the next generation of service-oriented flexible collaborative systems – a basic framework applied to medical research. In: Cordeiro, J., Filipe, J. (eds.) ICEIS 2008 - Proceedings of the Tenth International Conference on Enterprise Information Systems, number 978-989-8111-36-4, Barcelona, Spain, June 2008, pp. 232–239 (2008) 43. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990) 44. Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. In: ICWS 2006: Proceedings of the IEEE International Conference on Web Services, Washington, DC, USA, pp. 427–436. IEEE Computer Society, Los Alamitos (2006) 45. Stark, K., Schulte, J., Hampel, T., Schikuta, E., Zatloukal, K., Eder, J.: GATiBCSCW, medical research supported by a service-oriented collaborative system. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 148–162. Springer, Heidelberg (2008) 46. Stevens, R., Zhao, J., Goble, C.: Using provenance to manage knowledge of in silico experiments. Briefings in bioinformatics 8(3), 183–194 (2007) 47. Sturn, A., Quackenbush, J., Trajanoski, Z.: Genesis: cluster analysis of microarray data. Bioinformatics 18(1), 207–208 (2002) 48. Sweeney, L., Samarati, P.: Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE Symposium on Research in Security and Privacy (1998) 49. Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U S A 98(9), 5116–5121 (2001) 50. Ulitsky, I., Shamir, R.: Identification of functional modules using network topology and high-throughput data. BMC Systems Biology 1(1) (2007) 51. Yang, J.: Temporal Data Warehousing. Stanford University (2001) 52. Zatloukal, K., Yuille, M.: Information on the proposal for european research infrastructure. In: European Bio-Banking and Biomolecular Resources (2007)
Exploring Trust, Security and Privacy in Digital Business Simone Fischer-Hübner1, Steven Furnell2, and Costas Lambrinoudakis3,* 1
Department of Computer Science Karlstad University, Karlstad, Sweden [email protected] 2 School of Computing & Mathematics, University of Plymouth, Plymouth, United Kingdom [email protected] 3 Department of Information and Communication Systems Engineering, University of the Aegean, Samos, Greece [email protected]
Abstract. Security and privacy are widely held to be fundamental requirements for establishing trust in digital business. This paper examines the relationship between the factors, and the different strategies that may be needed in order to provide an adequate foundation for users’ trust. The discussion begins by recognising that users often lack confidence that sufficient security and privacy safeguards can be delivered from a technology perspective, and therefore require more than a simple assurance that they are protected. One contribution in this respect is the provision of a Trust Evaluation Function, which supports the user in reaching more informed decisions about the safeguards provided in different contexts. Even then, however, some users will not be satisfied with technology-based assurances, and the paper consequently considers the extent to which risk mitigation can be offered via routes, such as insurance. The discussion concludes by highlighting a series of further open issues that also require attention in order for trust to be more firmly and widely established. Keywords: Trust, Security, Privacy, Digital Business.
1 Introduction The evolution in the way information and communication systems are currently utilised and the widespread use of web-based digital services drives the transformation of modern communities into modern information societies. Nowadays, personal data are available or/and can be collected at different sites around the world. Even though the utilisation of personal information leads to several advantages, including improved customer services, increased revenues and lower business costs, it can be misused in several ways and may lead to violation of privacy. For instance, in the framework of *
Authors are listed in alphabetical order.
A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 191–210, 2009. © Springer-Verlag Berlin Heidelberg 2009
192
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
e-commerce, several organisations in order to identify the preferences of their customers and adapt their products accordingly develop new methods for collecting and processing personal data. Modern data mining techniques can then be utilised in order to further process the collected data, generating databases of the consumers’ profiles through which each person’s preferences can be uniquely identified. Therefore, such information can be utilised for invading user’s privacy and thereby compromising the 95/46 European Union directive on the protection of individuals with regard to the processing of personal and sensitive data. In order to avoid confusion, it is important to stress the difference between privacy and security; a piece of information is secure when its content is protected, whereas it is private when the identity of its owner is protected. It is true that, irrespective of the application domain (i.e. e-commerce, e-health etc), the major conservation of the users in using the Internet is due to the lack of privacy rather than cost, difficulties in using the service or undesirable marketing messages. Considering that conventional security mechanisms, like encryption, cannot ensure privacy protection (encryption for instance, can only protect the message’s confidentiality), new Privacy-Enhancing Technologies (PETs) have been developed. However, the sole use of technological countermeasures is not enough. For instance, even if a company that collects personal data stores them in an ultra-secure facility, the company may at any point in time decide to sell or otherwise disseminate the data, thus violating the privacy of the individuals involved. Therefore security and privacy are intricately related. Privacy as an expression of the human dignity is considered as a core value in democratic societies and is recognized either explicitly or implicitly as a fundamental human right by most constitutions of democratic societies. Today, in many legal systems, privacy is in fact defined as the right to informational self-determination, i.e. the right of individuals to determine for themselves when, how, to what extent and for what purposes information about them is communicated to others. For reinforcing their right to informational self-determination, users need technical tools that allow them to manage their (partial) identities and to control what personal data about them is revealed to others under which conditions. Identity Management (IDM) can be defined to subsume all functionality that supports the use of multiple identities, by the identity owners (user-side IDM) and by those parties with whom the owners interact (services-side IDM). According to Pfitzmann and Hansen, identity management means managing various partial identities (i.e. set of attributes, usually denoted by pseudonyms) of a person, i.e. administration of identity attributes including the development and choice of the partial identity and pseudonym to be (re-)used in a specific context or role (Pfitzmann and Hansen 2008). Privacy-enhancing identity management technology enforcing legal privacy principles of data minimisation, purpose binding and transparency have been developed within the EU FP6 project PRIME1 (Privacy and Identity Management for Europe) and the EU FP7 project PrimeLife2 (Privacy and Identity Management for Life). Trust has been playing an important role in PRIME and PrimeLife, because users do not only need to trust their own platforms (i.e. the user-side IDM) to manage their data accordingly but also need to trust the services sides that they process their data in a privacy-friendly and secure manner and according to the business agreements with the users. 1 2
https://www.prime-project.eu/ http://www.primelife.eu/
Exploring Trust, Security and Privacy in Digital Business
193
In considering the forms of protection that are needed, it is important to recognise that user actions will often be based upon their perceptions of risk, which may not always align very precisely with the reality of the situation. For example, they may under- or over-estimate the extent of the threats facing them, or be under- or overassured by the presence of technical safeguards. For example, some people simply need to be told that they a service is secure in order to use it with confidence. Meanwhile, others will only be reassured by seeing an abundance of explicit safeguards in use. As such, if trust is to be established, the security and privacy measures need to be provided in accordance with what users expect to see and are comfortable to use in a given context. Furthermore, how much each person values her privacy is a subjective issue. When a bank uses the credit history of a client without her consent, in order to issue a presigned credit card then it is subjective whether the client will feel upset about it and press charges for breech of the personal data protection Act or not. Providing a way to model this subjective nature of privacy would be extremely useful for organisations in the sense that they will be able to estimate the financial losses that they may experience after a potential privacy violation incident. This will allow them to reach cost-effective decision in terms of the money that they will invest for security and privacy protection reasons. This paper examines the relationship between the factors, and the different strategies that may be needed in order to provide an adequate foundation for users’ trust. It has been recognised that users often lack confidence that sufficient security and privacy safeguards can be delivered from a technology perspective, and therefore require more than a simple assurance that they are protected. In this respect, Section 2 first investigates social trust factors for establishing reliable end user trust and then presents a Trust Evaluation Function, which utilises these trust factors and supports the user in reaching more informed decisions about the trustworthiness of online services. Even then, however, some users will not be satisfied with technology-based assurances. As a consequence Section 3 considers the extent to which risk mitigation can be offered via routes, such as insurance. The discussion concludes with Section 4 that highlights a series of further open issues that also require attention in order for trust to be more firmly and widely established.
2 Trust in Online Services 2.1 Users’ Perception of Security and Privacy and Lack of Trust “Trust is important because if a person is to use a system to its full potential, be it an e-commerce site or a computer program, it is essential for her to trust the system” (Johnston et al. 2004). For establishing trust, a significant issue will be the user’s perception of security and privacy within a given context. Indeed, the way users feel about a given site or service is very likely to influence their ultimate decision about whether or not to use it. While some may be fully reassured by the presence of security technology, others may be more interested in other facts, such as the mitigation and restitution available to them in the event of breaches.
194
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Research conducted in the UK as part of the Trustguide project has investigated citizens’ trust in online services and their resultant views on the risks present in this context (Lacohee et al. 2006). Trustguide was part funded by the UK Government (through what was then the Department for Trade & Industry), and involved collaboration between British Telecom, HP Labs, and the University of Plymouth. The aim of the project was to better understand attitudes towards online security and thus enable the development of more effective ICT-based services. In order to investigate the issue, a series of ten focus groups were run with different types of UK citizen across six geographic locations. The categories of participant were: undergraduate students, postgraduate students, SMEs (three groups), farmers, ICT novices, ICT experts, and citizens (two groups). All of the groups followed the same discussion guide, and were professionally facilitated. The topics areas addressed included significant focus upon trust (from the perspective of citizens’ own use of online services, as well as their trust in those organisations that might gather their private data), as well as surrounding issues such as identity management and authentication, and the collection, storage and protection of data. One of the most significant findings was that the degree to which trust and confidence could be built into systems from the users’ perspective was likely to be limited. The research actually revealed a high degree of distrust in ICT-based services, with the focus group participants repeatedly voicing a belief that it is impossible to guarantee that electronic transactions or stored data are secure against attack. Indicative examples of the comments that emerged on this theme are presented below: “The real issue is that we know from our experience of the Internet and everything else that nobody has ever yet made anything secure. Whatever kind of encryption you’ve got, it can be broken.” “We know that no one has ever built a secure system, nothing held electronically can be secure. Banks (etc) should be more honest and say that data is never secure, and they should be open about the risks.” “Given that it’s actually impossible to make a secure system, perhaps banks and all the rest of them should stop telling us that it is secure, and rather, they should be taking measures to try and make it as secure as possible but assume that sooner or later it will be hacked, it will be broken into and with that assumption in mind, then what are the procedures?”. Also usability tests of privacy-enhancing identity management prototypes performed within the EU FP6 project PRIME have shown that there are problems to make people trust the claims about the privacy enhancing features of the systems (see FischerHübner and Pettersson 2004, Andersson et al. 2005). Although test users were first introduced into the aims and scope of privacy-enhancing identity management, the tests revealed that many of the test users did not trust the claim that the tested system would really protect their data and their privacy. Some participants voiced doubts over the whole idea of attempting to stay private on the Net: “Internet is insecure anyway” because people must get information even if it is not traceable by the identity management application, explained one test participant in a post-test interview. Another test subject stated: “It did not agree with my mental picture that I could buy a book anonymously”.
Exploring Trust, Security and Privacy in Digital Business
195
Another factor contributing to the lack of trust that was revealed by our usability tests was that test subjects generally had difficulties to mentally differentiate between user side and services side identity management. In post-test interviews the test subjects sometimes referred to functionalities from both the website and the user side identity management system as if these were one. Consequently, they also had difficulties to understand that the user side identity management console, where the user can manage her electronic identities, can be trusted by the user because it is within the user’s control, whereas the website is under the service provider’s control. Similar findings of a lack of trust in privacy enhancing technologies were also reported by others, e.g. by Günther and Spiekermann in a study on the perception of user control with privacy-enhancing identity management solutions for RFID environments, even though the test users considered the PETs in this study fairly easy to use (Günther and Spiekermann 2005). For helping users to evaluate the trustworthiness of a services side, the focus has to be on mediating factors to the users that measure the services side’s actual trustworthiness and that support trustworthy behavior of a side (Riegelsberger at al. 2005). 2.2 Social Trust Factors In this section, we investigate suitable parameters corresponding to social trust factors for measuring the actual trustworthiness of a communication partner in terms of privacy practices and of the reliability as a business partner and for establishing reliable trust. Social trust factors in the context of e-Commerce have already been researched by others. For instance, Turner (2001) showed that for ordinary users to feel secure when transacting with a website the following factors play a role: 1. the company’s reputation, 2. their experiences with the website, and 3. recommendations from independent third parties. Riegelsberger et.al. (2005) present a trust framework which is based on contextual properties (based on temporal, social and institutional embeddedness) and the services side’s intrinsic properties (ability, motivation based on internalized norms, such as privacy policies, and benevolence) that form the basis of trustworthy behavior. Temporal embeddedness can be signalled by visible investment in the business and the side, as e.g. visualised by professional website design, which can also be seen a symptom for the vendor’s intrinsic property of competence or ability to fulfill a contract. Taking the phenomena into consideration that many users have problems to differentiate between user and services side, these factors of professional design should in general be taken into account for the UI design of PrimeLife trust evaluation function even though it is not part of the vendor’s website but part of the user side identity management system. Social embeddedness, i.e. the exchange of information about a side’s performance among users, can be addressed by reputation systems. Institutional embeddedness refers to the assurance of trustworthiness by institutions, as done with trust seal programs. A model of social trust factors, which was developed by social science researchers in the PRIME project (Leenes e al. 2005), (Andersson et al. 2005), has identified 5 layers on which trust plays a role in online services: socio-cultural, institutional, service area, application, and media. Service area- related trust aspects which concern the trust put in a particular branch or sector of economic activity, as well as socio-cultural trust aspects
196
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
can however not be directly influenced by system designers. More suitable factors for establishing reliable trust can be achieved on the institutional and application layers of the model, which also refer to trust properties (contextual property based on institutional embeddedness as well as certain intrinsic properties of a web application) of the framework by Riegelsberger et al. (2005). As discussed by Leenes et al. (2005), on institutional layer, trust in a service provider can be established by monitoring and enforcing institutions, such as data protection commissioners, consumer organisations and certification bodies. Besides, on application layer, trust in an application can be enhanced if procedures are clear, transparent and reversible, so that users feel in control. This latter finding also corresponds to the results of the aforementioned Trustguide project, which also provides guidelines on how cybertrust can be enhanced and also concludes that increased transparency brings increased user confidence. Moreover, rather than receiving assurances of technological security, which many would perceive to be unrealistic, the research of the Trustguide project suggests that three other elements would contribute to helping people feel more secure in their use of ecommerce transactions: • • •
confidence that restitution can be made by a third party. Hence, measures that are in place in the event of something going wrong, should be clearly stated; assurances about what can and cannot be guaranteed; the presence of fallback procedures if something goes wrong.
2.3 A Trust Evaluation Function In this section, we will present a trust evaluation function that has been developed within the PrimeLife EU project. This function has the purpose of communicating reliable information about trustworthiness and assurance (that the stated privacy functionality is provided) of services sides. For the design of this trust evaluation function, we have followed an interdisciplinary approach by investigating social factors for establishing reliable trust, technical and organizational means, as well as HCI concepts for mediating evaluation results to the end users. Trust Parameters Used: Taking results of the studies on social trust factors presented in section 2.2 as well as available technical and organisational means into consideration, we have chosen the following parameters for evaluating the trustworthiness of communication partners that mainly refer to the institutional and application layers of the social trust factor model. Information provided by trustworthy independent monitoring and enforcing institutions, which we are utilising for our trust evaluation function, comprise: •
3
Privacy and trust seals certified by data protection commissioners or independent certifiers (e.g., the EuroPrise seal3, the TRUSTe seal4 or the ULD Gütesiegel5).
https://www.european-privacy-seal.eu/ http://www.truste.org/ 5 https://www.datenschutzzentrum.de/guetesiegel/index.htm 4
Exploring Trust, Security and Privacy in Digital Business
• •
197
Blacklists maintained by consumer organisations (such blacklists exist for example in Sweden and Denmark) Privacy and security alert lists, such as list of alerts raised by data protection commissioners or Google’s anti-phishing blacklist.
The European Consumer Centres have launched a web-based solution, Howard the owl, for checking trust marks and other signs of trustworthiness that could be used as well when evaluating a web shop6. Static seals can be complemented by dynamic (in real-time generated) seals conveying assurance information about the current security state of the services side’s system and its implemented privacy and security functions. Such dynamic seals can be generated in real-time by an “Assurance Evaluation” component that has been implemented within the PRIME framework (Pearson 2006). Dynamic seals that are generated by tamper-resistant hardware can be regarded as third-party endorsed assurances, as the tamper-resistant hardware device can be modeled as a third party that is not under full control of the services side. Such dynamic assurance seals can measure the intrinsic property of a side’s benevolence to implement privacy-enhancing functionality. Such functionality can comprise also transparency-enhancing tools that allow users to access, and to request to rectify or delete their personal data online (as implemented within the PrimeLife project), which will allow users to “undo” personal data releases and to feel in control. As discussed above, this is important prerequisite for establishing trust. For our trust evaluation function, we therefore used dynamic assurance seals informing about the PrimeLife privacy-enhancing functions that the services side’s system has implemented. Also reputation metrics based on other users' ratings can influence user trust, as discussed above. Reputation systems, such for instance the one in eBay, can however often be manipulated by reputation forging or poisoning. Besides, the calculated reputation values are often based on subjective ratings by non-experts, for whom it might for instance be difficult to judge the privacy-friendliness of communication partners. So far, we have therefore not considered reputation metrics for the PrimeLife trust evaluation function, even though we plan to address them in future research and versions of trust evaluations within the PrimeLife project. Following the process of trust and policy negotiation of the PRIME technical architectures (which on which also PrimeLife systems are based), privacy seals, which are digitally signed by the issuing institution, as well as dynamic assurance seals can be requested from a services side directly (see steps 4-5 in Figure 1), whereas information about blacklisting and alerts need to be retrieved from the third party list providers (see steps 6-7 in Figure 1). After the user requests a service (step 1), the services side replies with a request of personal data and a proposal of a privacy policy (step 2). For evaluating the side’s trustworthiness, the user can then in turn request trust and assurance data and evidences from the services side, such as privacy seals and dynamic assurance seals (steps 4-5), and information about blacklisting or alerts concerning this side from alert list or blacklist providers (steps 6-7). Information about the requested trust parameters are then evaluated at the user side and displayed via the trust evaluation user interfaces along with the privacy policy information of the services side within the “Send Personal 6
ready21.dev.visionteam.dk
198
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Data?” dialogue window (see below), with which also the user’s informed consent for releasing the requested data for the stated policy is solicited. The user can then based on the trust evaluation results and policy information decide on releasing the requested personal data items and possibly adopt the proposed policy, which is then replied to the service provider (step 8).
Fig. 1. Privacy and trust policy negotiation in PRIME and PrimeLife
Design Principles and Test Results: For the design of our trust evaluation function mock-ups, we followed the following design principles comprising general HCI principles as well as design principles, which should in particular address challenges and usability problems that we have encountered in pervious usability tests: •
Use a Multi-layered structure for displaying evaluation results, i.e. trust evaluation results should be displayed in increasing details on multiple layers in order to prevent an information overload for users not interested in the details or the evaluation. Our mockups have been structured into three layers displaying a short status view with the overall evaluations for inclusion in status bars and in the “Send Personal Data?” window (1st layer, see Figure 2) displaying also the services side’s short privacy policy and data request, a compressed view displaying the overall results within the categories privacy seals, privacy & security alert lists, support of PRIME functions and blacklisting (2nd layer), and a complete view showing the results of sub categories (3rd layer, see Figure 3).
Exploring Trust, Security and Privacy in Digital Business
199
Fig. 2. “Send Personal Data?” window displaying the overall trust evaluation result (1st Layer)
•
Use a selection of meaningful overall evaluation results. For example, in our mockups, we use a trust meter with a range of three possible overall evaluation results that provide a semantic by their names (which should be more meaningful than for instance percentages as used by some reputation metrics). The three overall results that we are using are (see trust meter in Figure 2): o
o
o
•
“Poor” symbolised with a sad-looking emoticon and red background colour (if there are negative evaluation results, i.e. the side is blacklisted or appears on alert lists); “Good” symbolised with a happy looking smiley and green background colour (if there is no negative, but some positive results, i.e. the side has a seal or supports PrimeLife functions and is not appearing on black/alert lists); “Fair” symbolised with a white background colour (for all other cases, i.e. the side has no seal, is not supporting PrimeLife functions, and is not appearing on black/alert lists).
Make clear who is evaluated - this is especially important, because as we mentioned above our previous usability tests have revealed that users have often difficulties to differentiate between user and services side (Pettersson et al. 2005). Hence, the user interface should make clear by its structure (e.g., by surrounding all information referring to a requesting services side, as illustrated in the “Send Personal Data?” window Figure 2), and by wording that the services side and not the user side is evaluated. If this is not made clear, a bad trust evaluation result for a services side might also lead to reduced trust in the user side IDM system.
200
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Fig. 3. Complete view of the Trust Evaluation Function (3rd layer) displaying the overall results
•
•
Structure the trust parameters visible on the second and third layers into the categories “Business reliability” (comprising the parameter “blacklisted”) and “privacy (comprising the parameters of security & privacy alert lists, privacy seals and PrimeLife function support). This structure should illustrate that the trust parameter used have different semantics and that scenarios with companies that are “blacklisted” for bad business practices, even though they have a privacy seal and/or support PrimeLife functions do not have to be contradictory, as they refer to different aspects of trustworthiness. Inform the user without unnecessary warnings - our previous usability tests showed that extensive warnings can be misleading and can even result in users loosing their trust in the PrimeLife system. It is a very difficult task for the systems designer to find a good way of showing an appropriate level of
Exploring Trust, Security and Privacy in Digital Business
201
alerting: for instance, if a web vendor lacks any kind of privacy seal, this in itself is not a cause for alarm, as most sites at present do not have any kind of trust sealing. We also did not choose the colour “yellow” for our trust meter for symbolizing such an evaluation result that we called “fair” (i.e. we did not use the traffic light metaphor), as yellow already symbolises a state before an alarming “red” state. First usability tests for three iterations of our PrimeLife trust evaluation function mockups were performed in the Ozlab testing environment of Karlstad University in two rounds with ten test persons each and one round with 12 tests persons. The tests clearly showed that such a function is much appreciated by end users. The presentation of overall evaluation results on top level, especially the green and red emoticons as well as the fact that the services side was evaluated were well understood. Some users had problems though to understand the “neutral” evaluation result (in case a side has no seal, is not supporting PrimeLife functions, is not blacklisted and does not appear on alert lists), which we first phrased with “ok”, and then “fair”. However, in the post-test interviews, there were no clear preferences for other names (such as “Not bad”, “No alert”). Hence, the illustration of “neutral” results is one of the most difficult issues and still needs to be investigated further (see also (Fischer-Hübner et al. 2009)).
3 Mitigating and Transferring Security and Privacy Risks The fact that many participants in the Trustguide project were nonetheless using services that they did not ultimately trust (from a technological perspective) was frequently linked to their beliefs that risk was mitigated in some other way. This was most clearly evident in relation to financial transactions involving credit cards, as illustrated by the following quotes: “If I’m buying something online with a credit card in the back of my mind, even if I don’t trust the site, is that my credit card is protected against that, they say in their blurb that if there is an unfortunate incident like that then they will pay.” “I’m not worried about giving out my card details, I’ve got mitigation insurance, if a card gets cloned, kill the account, it’s the bank’s problem.” Thus, from the users’ perspective there will be a demonstrable reduction in perceived risk in cases where the responsibility for handling any negative outcome is thought to rest with a third party. In reality, however, there will be limits to the extent to which such beliefs can realistically hold true. For example, both of the viewpoints quoted above are overlooking the potential for wider impacts, which the bank or credit card company may not be able to rectify. For example, even if the bank can prevent the victim from suffering the direct financial impact of an incident, it could still take months to clear up issues such as consequent damage to credit ratings. As an example of this, past estimates on the cost of identity theft have suggested that such incidents cost victims an average of $808 and require 175 hours of effort to put things right (Benner et al. 2000).
202
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
The Trustguide research as a whole revealed that assumptions regarding naivety of users are becoming less valid, and although many may be novices from the perspective of ICT knowledge, this does not mean that they are innocent in their use of online services. Indeed, many are well informed regarding the potential risks, and do not enter with a belief that they are secure. Their engagement is based upon a personal risk assessment (albeit conducted to varying levels of detail and effectiveness), in which they are weighing the perceived risk against the potential benefits, and engagement is more likely to occur in cases where they can clearly see the potential problems and understand how they could be rectified. There are, of course, two challenges posed by these findings. Firstly, the effects of some security breaches cannot be easily mitigated (e.g. overcoming a case of identity theft may be a non-trivial proposition, and beyond the capabilities that a single online service provider could guarantee to provide), and thus the user really needs to fall back upon a primary reliance upon the security technology to prevent certain categories of incident from occurring in the first place. Secondly, if the user’s default position is to consider that the technology cannot provide them with sufficient protection, it places them at a disadvantage in terms of placing trust in technologies that may be genuinely fit for purpose (i.e. they will not be getting as much reassurance from the presence of the technology as they should do). From the organisations’ side, the typical reaction of IT officials to the rapidly increasing number of threats and the highly sophisticated methods utilised for realising new attacks is to protect their systems through a series of technical security measures. However, in the absence of a scientifically sound methodology for evaluating the cost-effectiveness of the security measures employed, the problem is that they are unable to quantify the security level of their system and thus to determine the appropriate amount that they should invest for its protection. (Cavusoglu et al. 2004) have calculated that, on average, compromised organisations lost approximately 2.1% of their market value within two days from the day of the incident. (Moitra and Konda 2003) have demonstrated that as organisations start investing in information system security their protection increases rapidly, while it increases at a much slower rate as the investments reach a much higher level. It is therefore essential to facilitate ways for identifying how far organisations should go into investing for security, as well as for evaluating the effectiveness of the security measures that they implement. An additional issue for organizations that collect, store and process personal or / and sensitive data for its customers is the protection of their privacy. It is true that some people are really concerned about privacy protection issues, while others are not. This kind of diversity results into different estimations about the consequences that may occur in case of a privacy violation incident. It will be therefore really useful for organizations to provide them with appropriate models for estimating the expected impact level, in terms of the compensation that an individual may claim after a privacy breach. However, privacy valuation is by no means a trivial issue. A usual security incident may be valued in an objective way. For example, if due to some security incident the internet site of a company is unavailable for 1 hour, then it is quite straightforward to estimate the financial value of this incident in an objective manner by statistical estimation of the possible number of clients and their potential buys within this period. The situation is more difficult when it comes to privacy. For instance, when somebody’s telephone number is disclosed then it is rather subjective
Exploring Trust, Security and Privacy in Digital Business
203
whether one should care or not or if she will decide to press charges and ask for compensation. And given that this has happened what is the likely amount of the compensation? Again this is a very personal thing, (no matter whether the court grants the compensation or not) and somehow should be related to how much the particular client values her privacy. Given that the beliefs of users (and indeed organisations) are turning towards risk mitigation rather than a reliance upon technological safeguards, it is relevant to consider alternative options that exist and how they can be used. One such option for organizations is to insure their information systems against potential security and privacy violation incidents, aiming to balance the consequences that they will experience, in terms of financial losses, through the compensation that they will get from the insurance company. It should be emphasized that such an approach cannot and will not “replace” the technical security and privacy enhancing measures; it will act complementary. Even in that case, though, the difficulty for the insurance company is the calculation of the appropriate premium. 3.1 Insuring an Information System: Premium Calculation Recently, there is considerable interest from the Economics community in addressing the issue of insurance contracts for information systems. Indicatively, (Anderson 2001) applies economic analysis and employs the language of microeconomics (network externalities, asymmetric information, moral hazard, adverse selection, liability dumping etc) for explaining a number of phenomena that security researchers had previously found to be pervasive but perplexing. Also in (Gordon and Loeb 2002) present an economic model for determining the optimal amount to invest for protecting a given set of information. Finally in (Varian 2004) constructs a model based on economic agents decision-making on effort spent, to study systems reliability. An insurance company in order to calculate a premium that covers a car against theft or fire must, at least, have an accurate estimate of the current car’s value. If the client provides additional information, like, for instance, that a car alarm is installed, this is being evaluated by the insurance company and may result in a reduced premium. In analogy, an insurance company in order to calculate the premium for an information system will seek the following information: • •
What is the financial loss that the organisation will experience as a result of every possible security incident? How secure – well protected against potential risks - is the information system?
However, none of the above questions can be answered in a straightforward and accurate way, mainly because of the following facts: a) Every day new threats are appearing. How can someone quantify the consequences of a potential security incident if she doesn’t even know which are the major threats that the information system is facing? b) The effectiveness of a security measure cannot be presented in quantitative terms. It can only be evaluated during real attacks against the system, after it has been installed and integrated into the system’s operation. However, even in this case the evaluation cannot be accurate since there is no way to know if a specific security
204
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
measure has really prevented a security incident or not. This is in analogy to a home alarm system. If there is no record of a theft attempt, we don’t really know if this is because the home alarm has prevented it or because it simply didn’t happen irrespective of the home alarm. c) Finally, the environment of the information system has a significant impact to both the number and severity of potential threats and to the effectiveness of the security measures. For instance, the security requirements identified for an internetbased system are not the same if a wireless network was utilised instead. Also, an authentication mechanism may be extremely effective for the internet-based system but not for the wireless environment. In (Lambrinoudakis et al. 2005) a probabilistic structure in the form of a Markov model is presented for facilitating the estimation of the insurance premium as well as the valuation of the security investment. Let us assume that the information system may result into one of N different states after possible security incidents that affect a single asset Ak, where k = 1,..,M. We will denote these states by i, where i = 1,..,N. By i = 0 we will denote the state where no successful attack has been made on the information system and thus it is fully operational. We assume that at time t = 0 the information system is in the fully operational state i = 0 and as time passes it will end up in different states of non-fully operational status, that is it will end up into one of the states i = 1,..,N. We assume that the transition rates from state 0 in any other state i, as a result of a security incident compromising asset Ak, are known. Furthermore, the impact (financial Loss) of every possible security incident Li has been computed. Assuming that the transitions allowed are from the fully operational state to some other non-fully operational state and that the non-operational states are absorbing states, the use of the Markov model allows us to find the probability of the system being in different states and thus find the probability of different financial losses (Lambrinoudakis et al. 2005). Let us assume that the organisation has a utility function for its data, let us say u. Since in this simple model we assume that all consequences resulting form a security incident can be translated to financial losses, it is reasonable to assume that this utility function expresses the views of the organisation towards financial losses, i.e. it provides a way to evaluate how important a financial loss caused by a security incident is for the organisation. Then the optimal security investment can be calculated by maximizing the expected utility for the organisation: Max I E [ U(W – L(I) – I ] where:
I is the maximum amount available for security measures W is the initial wealth of the company and L is the expected loss, which of course depends on the amount I
Similarly, the optimal insurance contract should satisfy the following equation and thus can be utilised for calculating the insurance premium. U(W – π) = Ε [ U(W – L + C – π)]
Exploring Trust, Security and Privacy in Digital Business
where:
205
W is the initial wealth of the company π is the premium that the company has to pay to the insurer L is the expected loss C is the compensation that the insurer will pay in case of a security incident
This approach is useful only in cases where the transition rates and the Loss (impact values) figures are accurate (objective). 3.2 Insuring an Information System: Taking into Account Privacy Violation Incidents If we then want to study the risk that a firm is undergoing due to potential privacy incidents that it may have caused, we definitely need a subjective theory that will allow us to find out how much do individuals value their privacy. In an attempt to describe this complicated task, keeping it as free as possible from technicalities, (Yannakopoulos et al. 2008) introduces a simple model that incorporates the personalized view of how individuals perceive a possible privacy violation and if that happens how much do they value this. Our basic working framework is the random utility model (RUM) that has been extensively used in the past for modelling personalized decisions regarding financial issues, for instance: How much is someone prepared to pay for this type of car ? This model takes into account cases where one time the same individual may consider a privacy violation as annoying whereas another time she may not bother about it at all. Therefore it allows for the subjective nature of the problem. We assume that the individual j may be in two different states: State 0 refers to the state where no personal data is disclosed while State 1 refers to the state where personal data has been disclosed. The level of satisfaction of the individual j in state i=0,1 is given by the random utility function ui , j ( y j , z j ) + ε i , j where:
yj is the income (wealth) of the individual and zj is a vector related to the characteristics of the individual, e.g. age, occupation, technology aversion etc. The term ε i, j is a term that will be considered as a random variable and models the personalized features of the individual j, at state i.
State 1, the state of privacy loss, will be disturbing to individual j as long as u1, j ( y j , z j ) + ε 1, j < u0 , j ( y j , z j ) + ε 0, j and that may happen with probability P(ε 1, j − ε 0, j ) < u 0, j ( y j , z j ) − u1, j ( y j , z j ) . This is the probability that an individual will be bothered by a privacy violation and may be calculated as long as we know the distribution of the error term, and will depend on the general characteristics of the individual. Given that an individual j is bothered by a privacy violation, how much would she value this privacy violation, so how much would she like to be compensated for that? If the compensation is Cj then it would satisfy the random equation u1, j ( y j + C j , z j ) + ε 1, j = u 0 , j ( y j , z j ) + ε 0, j , the solution of which will yield a random
variable Cj which is the (random) compensation that an individual may ask for a
206
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
privacy violation. The distribution of the compensation will depend on the distribution of the error terms as well as on the functional form of the deterministic part of the utility function. We now assume that a series of claims Cj, may arrive at certain random times Nj. We need a satisfactory model for Nj and N(t), the total number of claims up to t. Assuming that the distribution of the arrival times is modelled as a Poisson distribution Pois(λ), the total claim up to time t will be given by the random sum: L(t ) =
N (t )
∑C
i
i =0
The distribution of L(t) depends on the distribution of Ci and on the distribution of the counting process N(t) -- assuming that our population is homogeneous, i.e. the Ci's are independent. Assuming independence between N(t) and the size of the arriving claims Cj, we may calculate the expected total claim and its variance: Ε[ L(t )] = Ε[ N (t )]Ε[C ]
Var ( L(t )) = Var ( N (t ))(Ε[C ]) 2 + Ε[ N (t )]Var (C )
Consider that an organization enters into an insurance contract with an insurance firm that undertakes the total claim X=L(t) that its clients may ask for, as a consequence of privacy breaches, over the time t of validity of the contract. This of course should be done at the expense of a premium paid by the IT business to the insurer, but how much should this premium be? Examples for premium calculations could be: π ( X ) = (1 + α )Ε[ X ] or π ( X ) = Ε[ X ] + f (var( X )) where a is called the safety loading factor, and usual choices for f(x) are f ( x) = ax or f ( x) = a x . Since the expectation and the variance of X=L(t) can be calculated within the context of the Random Utility Model, the premium calculation is essentially done.
4 Conclusions and Further Issues The paper has highlighted the importance of ensuring trust in digital business and identified the important contributions made by security and privacy in this context. It has also demonstrated that while technological safeguards have a part to play in providing the trust basis, they cannot be considered to be a complete solution from the user perspective. It should also be recognised that although the technological aspects presented in this paper represent key issues, they are far from the only ones that need to be considered. Indeed, several research challenges remain open, and some further areas of current activity are listed below: •
Multi-Lateral Secure Reputation Systems: As discussed above, reputation schemes based on other users' ratings can influence user trust. For this meaningful reputation metrics and ways of aggregating reputation values in a fair manner are needed, which protect against reputation poisoning and forging attacks. Moreover, reputation schemes need to be transferable and interoperable (allowing to transfer reputations between communities/groups), and as reputation information
Exploring Trust, Security and Privacy in Digital Business
207
also comprises personal data, reputation schemes need to be developed that are privacy-respecting. Reputation systems addressing the latter requirements are currently researched within the scope of the FIDIS7 and PrimeLife EU projects (see (Steinbrecher 2009)). •
Transparency Tools for Enhancing Privacy for End Users: Transparency of personal data processing is not only a basic privacy requirement that can be derived from European data protection legislation (cf. the rights of data subjects to be informed/notified about the processing of their data (Art. 10-11 EU Directive 95/46/EC) the right to access their data (Art.12 EU Directive 95/46/EC)). As discussed in section 2.2, increased transparency also brings increased user confidence and trust. Transparency tools for enhancing privacy include privacy policy evaluation tools (e.g., based on the P3P standard (P3P 2006)) or systems that keep track for the users what personal data their have released to what services side under which privacy policy and that allow users to access and correct or delete their data stored at services sides online (such as the Data Track developed in PRIME (Pettersson et al. 2006)). Further research is currently conducted within the PrimeLife EU project on tools for informing users about privacy implications of future actions based on computing the linkability of their transactions (Hansen 2008b). Besides, within the PrimeLife project transparency tools are currently developed that allow users to check whether their data have been processed in a legally compliant manner- for this secure logs have to be stored at the services sides that can only be accessed by the user or their proxies (e.g. by data protection commissioners). Besides that, further research and development on tools for informing user about passive (hidden) data collection will be needed for preserving transparency in future ubiquitous computing environments. (See (Hansen 2008a), (Hedbom 2009), (Hildebrandt 2009) for further surveys of transparency tools for enhancing privacy).
•
Usable security and privacy: Given that many of the reservations about performing online transactions relate to a lack of confidence in technology, users need to be in a position to accept and trust the safeguards that are provided to compensate. However, it is widely recognised that much of the reassurance that ought to be provided by security technologies can quickly be undermined if users cannot actually understand or use them appropriately (Furnell et al. 2006). Usability challenges can arise from a variety of directions, including over-reliance upon users’ technical knowledge and capability, presenting solutions that are over-complex or cumbersome to use, and technologies that simply get in the way of what the user is actually trying to do. These aspects consequently introduce the possibility of mistakes and mis-configuration, as well as the potential for safeguards to be turned off altogether if they are perceived to be too inconvenient. Usable solutions therefore demand attention in terms of clarity at the user interface level, as well as in terms of limiting the overheads that may be introduced in terms of system performance and interruptions to user activity.
7
www.fidis.net
208
•
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Multilateral security addressing privacy & trust in social communities: Social communities, such as online auctions and marketplaces, are increasingly used for conducting Digital Business. Besides, companies are more and more doing social network analyses for profiling business partners or job applicants or for the purpose of conducting direct marketing with the help of social network profiles. Whereas today’s privacy enhancing technologies are usually enforcing the privacy principle of data minimisation (i.e. allowing users to release as little data as possible), novel kinds of privacy-enhancing technologies will be needed for protecting social community users, who often rather like to present themselves in their social communities (i.e. their primary goal is usually not data minimisation). Furthermore, tools for evaluating the trustworthiness of social community partners will be needed. In the context of social communities, further research on lifelong privacy need to be conducted. Due to low costs and technical advances of media storage, masses of data can easily be stored, processed and are hardly ever deleted or forgotten. Therefore the question remains how a “right to start over” can be enforced in future. Finally after one’s death, legal and technical questions of digital heritage (Who inherits my data/social community account, and how can the ownership be transferred?) still remain.
Successful attention to these issues, in conjunction with those discussed in the preceding sections, will enable significantly more flexible and friendly approaches to achieving security and privacy in digital business. Combined with the addition reassurance that users can obtain from risk mitigation and transfer options, the result will be a far more comprehensive foundation for trust in the related online services. Acknowledgements. Thanks deserve the PrimeLife Activity 4 (HCI) project partners, particulary John Sören Pettersson, Erik Wästlund, Jenny Nilsson, Maria Lindström, Christina Köffel and Peter Wolkersdorfer, who contributed to the discussion of the design of the Trust Evaluation Function. Jenny Nilsson and Maria Lindström also mainly conducted the usability tests for this function. Also we would like to express our thanks to our colleagues Athanassios Yannakopoulos, Stefanos Gritzalis and Sokratis Katsikas, who have contributed to the work on privacy insurance contract modelling. Parts of the research leading to these results has received funding from the EU 7th Framework programme (FP7/2007-2013) for the project PrimeLife. The information in this document is provided "as is", and no guarantee or warranty is given that the information is fit for any particular purpose. The PrimeLife consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law.
References 1. Anderson, R.: Why Information Security is Hard – An Economic Perspective. In: 17th Annual Computer Security Applications Conference, New Orleans, Louisiana (2001) 2. Andersson, C., Camenisch, J., Crane, S., Fischer-Hübner, S., Leenes, R., Pearson, S., Pettersson, J.S., Sommer, D.: Trust in PRIME. In: Proceedings of the 5th IEEE Int. Symposium on Signal Processing and IT, Athens, Greece, December 18-21 (2005)
Exploring Trust, Security and Privacy in Digital Business
209
3. Benner, J., Givens, B., Mierzwinski, E.: Nowhere to Turn: Victims Speak Out on Identity Theft. CALPIRG/Privacy Rights Clearinghouse Report (May 2000) 4. Cavusoglu, H., Mishra, B., Raghunathan, S.: The effect of internet security breach announcements on shareholder wealth: Capital market reactions for breached firms and internet security developers. To appear in International Journal of Electronic Commerce (2004) 5. Fischer-Hübner, S., Pettersson, J.S., Bergmann, M., Hansen, M., Pearson, S., Casassa-Mont, M.: In: Aquisti, et al. (eds.) Digital Privacy – Theory, Technologies, and Practices. Auerbach Publications (2008) 6. Fischer-Hübner, S., Köffel, C., Wästlund, E., Wolkerstorfer, P.: PrimeLife HCI Research Report, Version V1, PrimeLife EU FP7 Project Deliverable D4.1.1 (February 26, 2009) 7. Furnell, S.M., Jusoh, A., Katsabas, D.: The challenges of understanding and using security: A survey of end-users. Computers & Security 25(1), 27–35 (2006) 8. Gordon, L., Loeb, M.: The Economics of Information Security Investment. ACM Transactions on Information and System Security 5(4), 438–457 (2002) 9. Günther, O., Spiekermann, S.: RFID and the perception of control: The consumer’s view. Communications of the ACM 48(9), 73–76 (2005) 10. Hansen, M.: Marrying transparency tools with user-controlled identity management. In: Proc. of Third International Summer School organized by IFIP WG 9.2, 9.6/11.7, 11.6 in cooperation with FIDIS Network of Excellence and HumanIT, Karlstad, Sweden, 2007. Springer, Heidelberg (2008) 11. Hansen, M.: Linkage Control – Integrating the Essence of Privacy Protection into Identity Management Systems. In: Cunningham, P., Cunningham, M. (eds.) Collaboration and the Knowledge Economy: Issues, Applications, Case Studies; Proceedings of eChallenges 2008, pp. 1585–1592. IOS Press, Amsterdam (2008) 12. Hedbom, H.: A survey on transparency tools for privacy purposes. In: Fourth FIDIS International Summer School 2008, in cooperation with IFIP WG 9.2, 9.6/11.7, 11.6. Springer, Heidelberg (2009) 13. Hildebrandt, M.: FIDIS EU Project Deliverable D 7.12: Behavioural Biometric Profiling and Transparency Enhancing Tools (March 2009), http://www.fidis.net 14. Johnston, J., Eloff, J.H.P., Labuschagne, L.: Security and human computer interfaces. Computers & Security 22(8), 675–684 (2003) 15. Köffel, C., Wästlund, E., Wolkerstorfer, P.: PRIME IPv3 Usability Test Report V1.2 (July 25, 2008) 16. Lacohee, H., Phippen, A.D., Furnell, S.M.: Risk and Restitution: Assessing how users establish online trust. Computers & Security 25(7), 486–493 (2006) 17. Lambrinoudakis, C., Gritzalis, S., Hatzopoulos, P., Yannacopoulos, A., Katsikas, S.: A formal model for pricing information systems insurance contracts. Computer Standards and Interfaces (indexed in ISI/SCI-E) 7(5), 521–532 (2005) 18. Leenes, R., Lips, M., Poels, R., Hoogwout, M.: User aspects of Privacy and Identity Management in Online Environments: towards a theoretical model of social factors. In: Fischer-Hübner, S., Andersson, C., Holleboom, T. (eds.) PRIME Framework V1 (ch. 9), June 2005, PRIME project Deliverable D14.1.a (2005) 19. Moitra, S., Konda, S.: The survivability of network systems: An empirical analysis, Carnegie Mellon Software Engineering Institute, Technical Report, CMU/SEI-200-TR-021 (2003) 20. Pearson, S.: Towards Automated Evaluation of Trust Constraints. In: Stølen, K., Winsborough, W.H., Martinelli, F., Massacci, F. (eds.) iTrust 2006. LNCS, vol. 3986, pp. 252–266. Springer, Heidelberg (2006)
210
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
21. Pettersson, J.S., Fischer-Hübner, S., Danielsson, N., Nilsson, J., Bergmann, M., Clauß, S., Kriegelstein, T., Krasemann, H.: Making PRIME Usable. In: SOUPS 2005 Symposium on Usable Privacy and Security, Carnegie Mellon University, Pittsburgh, July 6-8. ACM Digital Library (2005) 22. Pettersson, J.S., Fischer-Hübner, S., Bergmann, M.: Outlining Data Track: Privacyfriendly Data Maintenance for End-users. In: Proceedings of the 15TH Internation Information Systems Development Conference (ISD 2006), Budapest, 31 August -2nd September 2006. Springer Scientific Publishers, Heidelberg (2006) 23. Pfitzmann, A., Hansen, M.: Anonymity. Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management – A Consolidated Proposal for Terminology, Version v0.31 (February 15), http://dud.inf.tu-dresden.de/literatur/ Anon_Terminology_v0.31.doc#_Toc64643839 24. The Platform for Privacy Preferences 1.1 (P3P1.1) Specification, W3C Working Group Note (November 13, 2006) 25. Riegelsberger, J., Sasse, M.A., McCarthy, J.D.: The Mechanics of Trust: A Framework for Research and Design. International Journal of Human-Computer Studies 62(3), 381–422 (2005) 26. Steinbrecher, S.: Enhancing multilateral security in and by reputation systems. In: Fourth FIDIS International Summer School 2008, in cooperation with IFIP WG 9.2, 9.6/11.7, 11.6. Springer, Heidelberg (2009) 27. Turner, C.W., Zavod, M., Yurcik, W.: Factors that Affect the Perception of Security and Privacy of E-commerce Web Sites. In: Proceedings of the Fourth International Conference on Electronic Commerce Research, Dallas, TX (November 2001) 28. Varian, H.R.: Systems reliability and free riding. Working Paper (2004) 29. Yannakopoulos, A., Lambrinoudakis, C., Gritzalis, S., Xanthopoulos, S., Katsikas, S.: Modeling Privacy Insurance Contracts and Their Utilization in Risk Management for ICT Firms. In: Jajodia, S., Lopez, J. (eds.) ESORICS 2008. LNCS, vol. 5283, pp. 207–222. Springer, Heidelberg (2008)
Evolution of Query Optimization Methods Abdelkader Hameurlain and Franck Morvan Institut de Recherche en Informatique de Toulouse IRIT, Paul Sabatier University, 118, Route de Narbonne, 31062 Toulouse Cedex, France Ph.: 33 (0) 5 61 55 82 48/74 43, Fax: 33 (0) 5 61 55 62 58 [email protected], [email protected]
Abstract. Query optimization is the most critical phase in query processing. In this paper, we try to describe synthetically the evolution of query optimization methods from uniprocessor relational database systems to data Grid systems through parallel, distributed and data integration systems. We point out a set of parameters to characterize and compare query optimization methods, mainly: (i) size of the search space, (ii) type of method (static or dynamic), (iii) modification types of execution plans (re-optimization or re-scheduling), (iv) level of
modification (intra-operator and/or inter-operator), (v) type of event (estimation errors, delay, user preferences), and (vi) nature of decisionmaking (centralized or decentralized control). The major contributions of this paper are: (i) understanding the mechanisms of query optimization methods with respect to the considered environments and their constraints (e.g. parallelism, distribution, heterogeneity, large scale, dynamicity of nodes) (ii) pointing out their main characteristics which allow comparing them, and (iii) the reasons for which proposed methods become very sophisticated. Keywords: Relational Databases, Query Optimization, Parallel and Distributed Databases, Data Integration, Large Scale, Data Grid Systems.
1 Introduction At present, most of the relational database application programs are written in highlevel languages integrating a relational language. The relational languages offer generally a declarative interface (or declarative language like SQL) to access the data stored in a database. Three steps are involved for query processing: decomposition, optimization and execution. The first step decomposes a relational query (a SQL query) using logical schema into an algebraic query. During this step syntactic, semantic and authorization are done. The second step is responsible for generating an efficient execution plan for the given SQL query from the considered search space. The third step consists in implementing the efficient execution plan (or operator tree) [51]. In this paper, we focus only on query optimization methods. We consider multijoin queries without “group” and “order by” clauses. Work related to the relational query optimization goes back to the 70s, and began mainly with the publications of Wong et al. [138] and Selinger et al. [112]. These papers A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 211–242, 2009. © Springer-Verlag Berlin Heidelberg 2009
212
A. Hameurlain and F. Morvan
motivated a large part of the database scientific community to focus their efforts on this subject. The optimizer’s role is to generate, for a given SQL query, an optimal (or close to the optimal) execution plan from the considered search space. The optimization goal is to minimize response time and maximize throughput while minimizing optimization costs. The general problem of the query optimization can be expressed as follows [41]: let a query q, a space of the execution plans E, and a cost function cost (q) associated to the execution of p ∈E, find the execution plan calculating q such as the cost (q) is minimum. An optimizer can be decomposed into three elements [41]: a search space [85] corresponding to the virtual set of all possible execution plans corresponding to a given query, a search strategy generating an optimal (or close to the optimal) execution plan, and a cost model allowing to annotate operators' trees in the considered search space. Because of the importance, and the complexity of the query optimization problem [21, 75, 82, 103], the database community made a considerable effort to develop approaches, methods and techniques of query optimization for various Database Management Systems DBMS (i.e. relational, deductive, distributed, object, parallel) [7, 9, 21, 26, 47, 52, 61, 62, 79, 82, 103, 125]. The quality of query optimization methods depends strongly on the accuracy and the efficiency of cost models [1, 42, 43, 66, 99, 141]. There are two types of query optimization approaches [27]: static, and dynamic. During more than twenty years, most of the DBMSs have used the static optimization approach which consists of generating an optimal (or close to the optimal) execution plan, then executing it until the termination. All the methods, using this approach, suppose that the values of the parameters used (e.g. sizes of temporary relations, selectivity factors, availability of resources) to generate the execution plan are always valid during its execution. However, this hypothesis is often unwarranted. Indeed, the values of these parameters can become invalid during the execution due to several causes [98]: 1.
2.
Estimation errors: the estimation on the sizes of the temporary relations and the relational operator costs of an execution plan can be erroneous because of the absence, the obsolescence, and the inaccuracy of the statistics describing the data, or the errors on the hypotheses made by the cost model. For instance, the dependence or the independence between the attributes member of a selective clause (e.g. town=’Paris’ and country = ‘France’). These estimation errors are propagated in the rest of the execution plan. Moreover, [70] showed that the propagation of these errors is exponential with the number of joins. Unavailability of resources: at compile-time, the optimizer does not have any information about the system state when the query will run, in particular, about the availability of resources to allocate (e.g. available memory, CPU load).
Because of reasons quoted previously, the execution plans generated by a static optimizer can be sub-optimal. To correct this sub-optimality, some recent researches suggest improving the accuracy of parameter values used during the choice of the execution plan. A first solution consists in improving the quality of the statistics on the data by using the previous executions [1]. This solution was used by [20] to improve the estimation accuracy of the operator selectivity factors and by [117] to estimate the correlation between predicates. The second solution proposed by [80]
Evolution of Query Optimization Methods
213
concentrates on the distributed queries. The optimizer generates an optimal (or close to the optimal) execution plan, having deduced the data transfer costs and the cardinalities of temporary relations. In this solution, the query operators are executed on a tuple subset of the operands to estimate the data transfer costs and the cardinalities of temporary relations. In both solutions, the selected execution plan is executed until the termination, whatever are the changes in execution environment. As far as the dynamic optimization approach, it consists in modifying the suboptimal execution plans at run-time. The main motivations to introduce ‘dynamicity’ into query optimization [27], particularly during the resource allocation process, are based on: (i) willing to use information concerning the availability of resources, (ii) the exploitation of the relative quasi-exactness of parameter values, and (iii) the relaxation of certain too drastic and not realistic hypotheses in a dynamic context (e.g. infinite memory). In this approach, several methods were proposed in different environments: uni-processor, distributed, parallel, and large scale [3, 4, 6, 7, 8, 12, 14, 15, 17, 18, 27, 30, 37, 48, 50, 56, 59, 72, 73, 74, 76, 87, 95, 98, 101, 102, 105, 106, 107, 140]. All these methods have the capacity of detecting the sub-optimality of execution plans and modifying these execution plans to improve their performances. They allow to the query optimization process to be more robust with respect to estimation errors and to changes in execution environment. The rest of this paper is devoted to provide a state of the art concerning the evolution of query optimization methods in different environments (e.g. uni-processor, parallel, distributed, large scale). For each environment, we try to describe synthetically some methods, and to point out their main characteristics [67, 98], especially, the nature of decision-making (centralized or decentralized), the type of modification (re-optimization or re-scheduling), the level of modification (intra-operator and/or inter-operator), and the type of event (estimation errors, delay, user preferences). The major contributions of this paper are: (i) understanding the mechanisms of query optimization methods with respect to considered environments and their constraints, (ii) pointing out their main characteristics which allow comparing them, and (iii) the reasons for which proposed methods become very sophisticated. This paper is organized as follows: firstly, in section 2, we introduce two main search strategies (enumerative strategies, random strategies) for uni-processor relational query optimization. Then, in section 3 we present a synthesis of some methods in a parallel relational environment by distinguishing the two phase and one phase approaches. Section 4 provides global optimization methods of distributed queries. Section 5, describes, in data integration (mediation) systems, both types of dynamic optimization methods: centralized and decentralized. Section 6 is devoted to give an overview of query optimization in large scale environments, particularly in data grid environments. Lastly, before presenting our conclusion, in section 7, we provide a qualitative analysis of described optimization methods, and point out their main characteristics which allow comparing them.
2 Uni-processor Relational Query Optimization In the uniprocessor relational systems, the query optimization process consists of two steps: (i) logical optimization which consists in applying the classic transformation
214
A. Hameurlain and F. Morvan
rules of the algebraic trees to reduce the manipulated data volume, and (ii) physical optimization which has roles of [90]: (a) determining an appropriate join method for each join operator by taking into account the size of the relations, the physical organization of the data, and access paths, and (b) generating the order in which the joins are performed [69, 84] with respect to a cost model. In this section, we focus on physical optimization methods. We begin at first, to define, characterize, and estimate the size of the search space. Then, we present search strategies. These are based either on enumerative approaches, or random approaches. Finally, we synthesize some analyses and comparisons stemming from performance evaluations of proposed strategies. 2.1 Search Space 2.1.1 Characteristics In relational database systems [31, 120, 130], each query execution plan can be represented by a processing tree where the leaf nodes are the base relations and the internal nodes represent operations. Different tree shapes have been considered: left-deep tree, right-deep tree, and bushy tree. The Fig.1 illustrates tree structures of relational operators associated with the multi-join query R1∞ R2 ∞ R3 ∞ R4 ∞ R4. result
result
join join join R1 R2 Left-deep tree (((R1 R2)
result
join join
R1
R4
R2
R3
R3)
R4)
join join join
R1
R3 R4 ((R1 Right-deep tree (R1 (R2 (R3 R4)))
join
R2 R3 Bushy tree R2) (R3
R4 R4))
Fig. 1. Tree shape
A search space can be restricted according to the nature of the execution plans and the applied search strategy. The nature of execution plans is determined according to two criteria: the shape of the tree structures (i.e. left-deep tree, right-deep tree and bushy tree) and the consideration of plans with Cartesian products. The queries with a large number of join predicates make the difficulty to manage associated search space which becomes too large. That is the reason why some authors [122, 123] chose to eliminate bushy trees. This reduced space is called valid space. This choice is due to the fact that this valid space represents a significant portion of the search space, which is the optimal solution. However, this assertion was never validated. Others, such as [100], think that these methods decrease the chances to obtain optimal solutions. Several examples [100] show the importance of the impact of this restrictive choice.
Evolution of Query Optimization Methods
215
2.1.2 Search Space Size The importance of the query shape1 (i.e. linear, star or clique) and of the nature of the execution plans is due to their incidence on the size of the search space. If we have N relations in a multi-join query, the question is to know how many execution plans being able to be built, taking into account the nature of the search space. The size of this space also varies according to the shape of the query. In this case, [85, 124] proposed a table illustrating the lower and superior boundary markers of the search space by taking into account the nature of this one and characteristics of the queries which are: the type of a query (i.e. repetitive, ad-hoc), the query shape, and the size of a query (i.e. simple, medium, complex). The results presented in [85, 124] point out the exponential growth of the number of execution plans according to the number of relations. This shows the difficulty to manage a solution set which is sometimes very large. Therefore, this brings the necessity of adapting the search strategy to the query characteristics. 2.2 Search Strategies In the literature, we distinguish, generally, two classes of strategies allowing to solve the problem of the join scheduling for the query optimization: -
Enumerative strategies Random strategies.
The description of the principles of search strategies leans on the generic search algorithms described in [84] and on the comparative study between the random algorithms proposed by [69, 70, 86, 122, 123]. 2.2.1 Enumerative Strategies These strategies are based on the generative approach. They use the principle of dynamic programming (e.g. optimizer of System R). For a given query, the set of all possible execution plans is enumerated. This can lead to manage a search space too large in case of complex queries. They build execution plans from sub-plans already optimized by starting with all or part of base relations of a query. In the whole of generated solutions, only the optimal execution plan is returned for the execution. However, the exponential complexity of such strategies has led many authors to propose more efficient strategies. So enumerative strategies allow to discard bad states by introducing heuristics (e.g. depth- first search with different heuristics [123]). Several strategies are described in [84]. 2.2.2 Random Strategies The enumerative strategies are inadequate in optimizing complex queries because the number of execution plans quickly becomes too large [85, 124]. To resolve this problem, random strategies are used. The transformational approach characterizes this kind of strategies. Several rules of transformation (e.g.; Swap, 3Cycle, Join commutativity/ associativity) were proposed [69, 70, 122] where the validity depends on the nature of the considered search space [86]. 1
The query shape indicates the way where the relations are joined by means of predicates, as well as the number of referenced relations.
216
A. Hameurlain and F. Morvan
The random strategies start generally with an initial execution plan which is iteratively improved by the application of a set of transformation rules. The start plan(s) can be obtained through an enumerative strategy like Augmented Heuristics. Two optimization techniques were abundantly already studied and compared: the Iterative Improvement and the Simulated Annealing [68, 69, 70, 84, 122, 123]. The performance evaluation of these strategies is very hard because of strong influence, at the same time, of random parameters and factors. The main difficulty lies in the choice of these parameters (e.g local / global minimum detection, algorithm termination criterion, initial temperature, termination criterion of inner iteration). Indeed, the quality of execution and the optimization cost depend on the quality of choice. After the tuning of the parameters, the comparison of the algorithms will allow to determine the most efficient random algorithm for the optimization problem of complex queries. However, the results obtained by [122] and by [69] differ radically because, for [122], the Iterative Improvement algorithm is better than the Simulated Annealing, while for [69], we have the opposite (even if for these last ones, their conclusion remains more moderate). The parameters were determined thanks to experiments with various alternatives, in the case of [69], or by applying the methodology of the factorial experiments [122]. An example of the use of these factorial experiments is given in [121]. 2.3 Discussion In [69, 70, 122, 123], the authors concentrated their efforts on the performance evaluation of the random algorithms for Iterative Improvement and the Simulated Annealing. However, the difference of their results underlines the difficulty of such evaluation. Indeed, for Swami and Gupta [122, 123], the Simulated Annealing algorithm is never superior to the Iterative Improvement whatever the time dedicated to the optimization is, while for Ioannidis and Cha Kong [69, 70], it is better than the Iterative Improvement algorithm after some optimization time. In [69, 70], the authors try to explain this difference. First, the considered search space is restricted to the left-deep trees in the case of Swami and Gupta [122, 123], while Ioannidis and Cha Kong [69, 70] study the search space in its totality. In [70], the authors spread their works on the study of the shape of the cost function by stressing the analysis of the linear and bushy spaces, and take into account only results waited in this restricted portion by the search space in order to keep the comparison coherent. The second difference concerns the join method. Swami and Gupta choose the hash join method, while Ioannidis and Kong [69, 70] use two join methods: nested loop and sort merge join. They choose even integrating the hash join method to show that their results do not depend on the method chosen. Another variant in the cost evaluation of the execution plan (CPU time for the first ones and I/O time for the second) has, either, no significant incidence on the difference of the results. On the other hand, they intuitively think that the number of nearest plans, the determination of the local minimum in the case of the Iterative Improvement algorithm and the definition of the transformation rules to be applied are important elements in the explanation of this difference. For example, if the number of nearest plans is not rather large, we can discard potential local minima and even indicate it as such, while they are not in reality. In that case, the results are skewed. The transformation rules applied by Swami
Evolution of Query Optimization Methods
217
and Gupta [122, 123] generates nearest execution plans with a significant difference in cost [69]. Hence, the Simulated Annealing algorithm has no more the possibility of crossing a long moment in this zone of low-cost plans and offers then insufficient improvement. However, the algorithm of the Iterative Improvement can easily reach a local minimum. The termination criterion of the Simulated Annealing defined in [122] does not give the time to the probability to decrease sufficiently. Indeed, when the time limit is reached, the probability to accept execution plans with high cost is still too high and the produced optimal plan has a still too expensive.
3 Parallel Relational Query Optimization Parallel relational query optimization methods [57] can be seen as an extension of relational query optimization methods developed for the centralized systems, by integrating the parallelism dimension. Indeed, the generation of an optimal parallel execution plan (or close to optimal), is based on either a two-phase approach [60, 63], or on a one-phase approach [24, 86, 111, 142]. A two-phase approach consists in two sequential steps: (i) generation of an optimal sequential execution plan (i.e. logical optimization followed by a physical optimization), and (ii) resource allocation to this plan. The last step consists, at first, in extracting the various sources of parallelism, then, to assign the resources to the operations of the execution plan by trying to meet the allocation constraints (i.e. data locality, and various sources of parallelism). As far as the one-phase approach, the steps (i) and (ii) are packed into one integrated component [90]. The fundamental distinction between both approaches is based on the query characteristics and the shape of the search space [57]. In the proposals concerning parallel relational query optimization few authors [55, 61, 79] proposed a synthesis dedicated to parallel relational query optimization methods. Hasan et al [61] have briefly introduced what they consider the major issues to be addressed in parallel query optimization. The issues that are tackled in [79] include, mainly, the placement of data in the memory, concurrent access to data and some algorithms for parallel query processing. These algorithms are restricted to parallel joins. As far as proposals [55], the authors describe, in a very synthetic way, data placement, static and dynamic query optimization methods, and accuracy of the cost model. Nevertheless, the authors do not show how we can compare the two optimization approaches, and how we can choose the appropriate optimization approach. Last year, Taniar et al. [126] provide the latest principles, methods and techniques of parallel query processing in their book. The rest of the section is devoted to provide an overview of some static and dynamic query optimization methods in a parallel relational environment by distinguishing the two phase and one phase approaches [57]. 3.1 Static Parallel Query Optimization Methods In this sub-section, we describe some one-phase and two-phase optimization strategies of parallel queries in a static context.
218
A. Hameurlain and F. Morvan
3.1.1 One-Phase Optimization In a one-phase approach, Schneider et al. [111] propose a parallel algorithm to process a query compound of N joins for each search space shape (i.e. left-deep tree, rightdeep tree and bushy tree, Cf. Fig. 1). The authors consider two methods of hash join: the simple hash join and the hybrid hash join. [111] reports for each search space shape, the need in memory size, the potential scheduling, and the capacity to exploit the different forms of parallelism. The study includes the case where the memory resource is unlimited and the more realistic case where the memory is limited. In the first case, the right deep tree is the most adapted to best exploit the parallelism. But, this structure is no longer the best when the memory is limited. Indeed, there are several strategies allowing to exploit the capabilities of the right deep trees when the memory is limited. The strategy, named "Static Right Deep Scheduling" [111], consists in cutting the right deep tree in several separate sub-trees in a way that the sum of the sizes of all the hash tables of a sub-tree can fit in memory. The temporary results of the execution of sub-trees T1, T2 …Tn will be stored in disks. The drawback of this strategy is that the number of sub-trees increases with the number of base relations which are not held stored in memory. Hence, this method reduces the pipeline chain and increases the response time. Two methods were proposed, one is based on segmented right-deep trees [24], and the other one is based on zigzag trees [142]. The objective of these two methods is to avoid the investigation of the bushy tree search space and then simplifying the optimization process. 3.1.2 Two-Phase Optimization In the two-phase approach, Hasan et al. [23, 60] propose several scheduling strategies of pipelined operators. To improve the response time, they develop an execution model ensuring the best trade-off between parallel execution and communication overhead. Several scheduling algorithms (i.e. processor allocation) are then proposed. They are inspired by the heuristic LPT (Largest Processing Time). These algorithms exploit pipeline and intra-operation (partitioned) parallelisms. Indeed, the authors firstly propose scheduling algorithms exploiting only the pipeline parallelism (POT Pipelined Operator Tree Scheduling), then they show how to extend these algorithms to take into account the intra-operation parallelism and the communication costs. The scheduling principle of the POT is decomposed into several steps [23]: (i) generation of operators' monotonous tree [60] from operators' tree, (ii) fragmentation of the monotonous tree which consists in cutting the monotonous tree in a set of fragments, and (iii) scheduling which consists in assigning processors to fragments. The main difficulty lies in the determination of the number of fragments and the size of each fragment by insuring the best tradeoff between parallel execution - communication overhead. As for the works of Garofalakis et al. [44, 45], they can be seen as an elegant extension of the propositions of [23, 41, 60]. Indeed, the works of [44, 45] take into account the fact that the parallel query execution requires the allocation of several resource types. They also introduce an original way to resolve this resource allocation by a simultaneous scheduling (e.g. parallelism extraction) and mapping method. First, [44] present a scheduling and mapping static strategy on a shared nothing parallel architecture, considering the allocation of several “preemptive” resources (e.g. processors). Next, the authors extend their own works in [45] for hybrid multi-processor architecture. This
Evolution of Query Optimization Methods
219
extension consists, mainly, in taking into account the "no preemptive" resource (e.g. memory) in their scheduling and mapping method. 3.2 Dynamic Parallel Query Optimization Methods The main motivations to introduce ‘dynamicity’ into query optimization [27], in particular in the resource allocation strategies, are based on: (i) the will to use,, information concerning the availability of the resources to allocate, (ii) the exploitation of the relative quasi-exactness of the metrics, and (iii) the relaxation of certain too drastic and not realistic hypotheses in a dynamic context. This sub-section describes in a synthetic way some one-phase and two phase parallel query optimization strategies. It should be pointed out that the proposed resource allocation methods become very complex and sophisticated in such a dynamic context. 3.2.1 One-Phase Optimization In this approach, the majority of work point out the importance of the determination of the join operation parallelism degree and the resource allocation method (e.g. processors and memory). Thus, it becomes interesting to synthesize some methods proposed in the literature, mainly [19, 81, 96, 107]. In their most recent work Brunie et al. [18, 19, 81] are not only interested in a multi-join process in a multi-user context, but also consider the current system state in terms of multi-resource contention. [18] studied, more generally, the relational query optimization on shared nothing architecture. The optimizer MPO (Modular Parallel query Optimizers) [81] determines dynamically the intra-operation parallelism degree of the join operators of a bushy tree. The authors suggest a dynamic heuristic to resource allocation in four steps applied in the following order: (i) Preservation of the data locality (or “data localization”), (ii) Size of the memory, (iii) I/O Reduction, and (iv) Operation serialization of a bushy tree: The proposals of Mehta et al. [96] and Rahm et al. [107] were developed independently of one-phase and two-phase approaches. Furthermore, their proposals are very representative and describe relevant and original solutions with respect to the problems identified above (i.e. determination of the parallelism degree and the resource allocation methods), we chose to include them in the one-phase approach. Mehta et al. [96] propose four algorithms (Maximum, MinDp, MaxDp, and RateMatch) to determine the join parallelism degree independently of the initial data placement. The originality of the algorithm Rate tries to make correspond the production rate of the result tuples of an operator with the consumption rate of next operator tuples. Then, the authors describe six alternative methods of processor allocation in the clones of a unique join operator. They are based on heuristics such as the random or round-robin strategies, and on a model taking into account the effect of the resource contention. As for the proposals of Rahm et al. [107], who extend the works of [95], they tackle the problem of the dynamic workload balancing of several queries compounded in a single hash join on a shared nothing architecture. The intra-operation parallelism of a join as well as the choice of the execution processors of the join are determined in a “integrated” way (.i.e. in a single step) by considering the current system state. This state is characterized by using the resources “bottlenecks”: CPU, memory, and disk.
220
A. Hameurlain and F. Morvan
3.2.2 Two-Phase Optimization XPRS adapting scheduling method In the system XPRS (eXtended Postgres one Raid and Sprite) [118], implanted on shared memory parallel architecture, Hong [63] proposes an adaptive scheduling method of fragments stemming from the best sequential execution plan represented by a bushy tree. Fragments are used as unity of parallel execution and they will also be called tasks in this sub-section. The adaptive scheduling algorithm is based on the following three elements: (i) classification of the “IO-bound” and “CPU-bound” tasks, (ii) computing method of the IO-CPU balance point of two tasks, and (iii) mechanism of dynamic adaptation of the parallelism degree of a task. The proposed strategy by [63] consists in finding task scheduling which maximizes the use of the resources (i.e. processors and disks), and thus minimizes the response time. For that purpose, [63] defines two types of tasks: the IO-bound tasks (limited by Input / Output) and the CPU-bound tasks (limited by the number of processors). To maximize the resource utilization (e.g. when one of both tasks ends, a part of resources remains unused), [63] proposes a dynamic adaptation method of the parallelism degree of a task according to the implemented distribution methods (i.e. roundrobin, interval). This method is used in the adaptive scheduling so that the system always works on the IO-CPU balance point. Dynamic re-optimization methods of sub-optimal execution plans In Kabra et al. [77], where the idea is close Brunie and al. [18], the authors propose a dynamic re-optimization algorithm which detects and corrects sub-optimality of the execution plan produced by the optimizer at compile time. This algorithm is implanted in the system Paradise [33] which is based on the static optimizer OPT++ [78]. The authors show that sub-optimality of an execution plan can result: (i) in a poor join scheduling, (ii) in the inappropriate choice of the join algorithms, or (iii) in a poor resources allocation (CPU and memory). These three problems would be caused by erroneous or obsolete cost estimations, or another lack of information necessary for the static optimization, concerning to the system state. The basic idea of this algorithm is founded on the collection of the statistics in some key-points during the query execution. The collected statistics correspond to the real values (observed during the execution), where the estimation is subject to error at compile time (e.g. size of a temporary relation). These statistics are used to improve the resource allocation or by changing the execution plan of the remainder of the query (i.e. the part of the query, which is not executed yet). As for the re-optimization process, it will be engaged only in case of estimation errors really bringing sub-optimality besides of the execution plan. Indeed, on the basis of these new improved estimations, if they are different in a significant way from those supplied by the static optimizer a new execution plan of the remainder of the query is generated in the case where it brings a minimum benefit.
4 Distributed Query Optimization The main motivation of the distributed databases is to present data which are distributed on networks of type LAN (Local Area Network) or of type WAN (Wide Area
Evolution of Query Optimization Methods
221
Network) in an integrated way to a user. One of the objectives is to make data distribution transparent to the user. In this environment, the main steps of the evaluation process of a distributed query are data localization and optimization. The optimization process [82, 103] takes into account network particularities. Indeed, contrary to the interconnection network of a multi-processor, networks have a lower bandwidth and a more important latency. For example, with a satellite connection the latency exceeds the half-second. These particularities are significant in cost of a distributed execution plan that authors [10, 103] are focused. They suppose that the communication cost is widely superior to those of the I/O and the CPU. So, many works focus on the communication cost to the detriment of CPU and I/O costs. At present, with the improvement of network performance, the cost functions used by the optimization process take into account the processing (i.e. CPU and I/O) and communication time together. The optimization process of a distributed query is composed of two steps [103]: a global optimization step and a local optimization step. The global optimization consists of: (i) determining the best execution site for each local sub-query considering data replication, (ii) finding the best inter-site operator scheduling, and (iii) placing these last ones. As for local optimization, it optimizes the local sub-queries on each site which are involved to the query evaluation. The inter-site operator scheduling and their placement are very important in a distributed environment because they allow to reduce the data volumes exchanged on the network and consequently to reduce the communication costs. Hence, the estimation accuracy of the temporary relation sizes that must be transferred from a site to another one is important. In the rest of this section, we present global optimization methods of distributed queries. They differ by the objective function used by the optimization process and by the type of approach: static or dynamic. 4.1 Static Distributed Query Optimization In distributed environments, various research works concerning the static query optimization are focused mainly on the optimization of inter-site communication costs. The idea is to minimize the data volume transferred between sites. In this perspective, there are two methods to process inter-site joins [103]: (i) the direct join by moving one relation or both relations, and (ii) the join based on semi-join. This alternative consists in replacing a join, whatever the class of algorithm implanting this join is, by the combination of a projection, and a semi-join ended by a join [25]. The cost of the projection can be minimized by encoding the result [133]. The benefit of a join based on semi-join with respect to a direct joint is proportional in the join operator selectivity [134]. According to the relation profiles (e.g. relation size), the optimizer will choose the approach which minimizes the data volume transferred between sites. For example, the SDD-1 system [10] often uses the join based on semi-join. However, System R* [113] avoids to use it. Indeed, the use of a join based on semi-join can increase the query processing time. Mackert and Lohman [91] showed the importance of the local processing cost in the performance of a distributed query. Furthermore, its consideration by the optimizer significantly increases the size of the search space. Indeed, in a query, there are several possibilities of join based on semi-join for a given relation. The number of join based on semi-join is an exponential function
222
A. Hameurlain and F. Morvan
which depends of the number of temporary relations resulting from local sub-queries [103]. This explains why many optimizers do not use this alternative. The quality of a distributed execution plan which is generated by the global optimization process depends on the accuracy of the used estimations. However, it is difficult to estimate the parameters (e.g. relation profile, resource availability) used by the optimizer. Generally, the used cost models made the assumption of processor and network uniformity. These cost models assume that all processors and network connections have the same speed and bandwidth, like in a parallel environment. Furthermore, they do not take into account the workload of processors nor that of the network. Based on these observations, several works [80, 119] try to improve the accuracy of these parameters. In this objective the Mariposa distributed DBMS [119] leans on an economic model in which querying servers buy data from data server. Each query Q, which is decomposed into several sub-queries Q1, Q2,…, QN, is administered by a broker. A broker obtains bids for a sub-query Qi from various sites. After choosing the better bid, the broker notifies the winning site. The advantage of this method is that it leans on the local cost models of every DBMS which can participate in the query evaluation. So, it considers the processor heterogeneity and takes into account their workload. [80] propose that the optimizer generates an optimal (or close to the optimal) execution plan, having deduced the data transfer costs and the cardinalities of temporary relations. In this solution, the operators of a query are executed on a tuple subset of the operands to estimate the data transfer costs and the cardinalities of temporary relations. After deduced the cost of these parameters, an optimal execution plan is generated and executed until the termination, whatever the changes in execution environment are. 4.2 Dynamic Distributed Query Optimization A solution to correct the sub-optimality of an execution plan consists in changing the operation scheduling at run-time. In the multi-database MIND system, Ozcan et al. [102] proposed strategies for dynamic re-scheduling of inter-site operators (e.g., join, union) to react to the inaccuracies of estimations. The inter-site operators can be executed as soon as two sub-queries which are executed on different sites produced their results. These strategies use the partial results available at run-time to define the scheduling of the executions between the inter-site operators. The query processing is done in two steps [37]: 1. 2.
Compilation. During this step, a global query is decomposed into local subqueries. The sub-queries are sent to different sites to be executed in parallel. Dynamic scheduling. This step defines a dynamic scheduling between the operations consuming the results of sub-queries sent on sites. When a sub-query produces its result, a threshold is associated to the result. This threshold is used to determine if the result must be consumed immediately to execute a join with another result already available, or if the consumption of this result will be delayed while waiting for another result, which is unavailable in this moment. The threshold associated with a result is calculated according to the costs and selectivity factors of all joins connected to this result.
Evolution of Query Optimization Methods
223
This scheduling strategy reduces the uncertainty of estimations since it is based on the execution times of local sub-queries. Moreover, it avoids the needs to know the cost models of the various databases.
5 Query Optimization in Data Integration Systems Data integration systems extend [22, 53, 88, 127, 128, 136] the distributed database approach to multiple, autonomous, and heterogeneous data sources by providing uniform access (same query interface in read only to all sources). We use the term data source to refer any data collection which his owner wishes to share with other users. The main differences of a distributed database approach are the number of data sources and the heterogeneity of the data sources. The distributed database approach addresses about tens of distributed databases while data integration system approach can scale up to hundreds of data sources [104]. In addition to the material heterogeneity (i.e. CPU, I/O, network) due to the environment, the data sources are heterogeneous by their data structure (e.g. relational or object). Moreover, the software infrastructures allowing the access to data sources have different capabilities for processing queries. For example, a phone book service which requires the name of a person to return a phone number is a data source where the access is restricted. In this context, we need new operators in order to access to data sources and to, for instance, join two relations. Consider an execution plan that needs a relational join between Employee (empId, name) and Phone (name, phoneNumber) tables on their name attribute. In a standard join both of the following fragments: Join (Employee, Phone) and Join (Phone, Employee) are valid since join is a commutative operator. However, with restricted sources, the second fragment Join (Phone, Employee) on name attribute is not valid, since Phone requires the value of the name attribute in order to return the value of the phoneNumber. In consequence, we need a new join operator which is asymmetric in nature, also known as dependent join Djoin [46]. The asymmetry of this operator causes the search space to be restricted and raises the issue of capturing valid (feasible) execution plans [92, 93, 139]. In an environment with hundreds of data sources connected on Internet it is even more difficult to estimate, at compile time, the availability of the resources like network, CPU or memory. Hence, many authors propose dynamic optimization strategies to correct the sub-optimality of execution plans at runtime. Initially, proposed methods are centralized [3, 4, 7, 14, 15, 32, 74, 109]. A dynamic optimization method is said to be centralised if there is a unique process, generally the optimiser, which is charged to supervise, control and modify the execution plans. This process can be based on other modules ensuring the production of necessary information for the modifications and the control of an execution plan. On other hand, in this environment, two phenomena that occur frequently are significant: initial delays before data start arriving and bursty arrivals data thereafter [72]. In order to react to these unpredictable data arrival rate, several authors propose to decentralize the control inside the operator [72, 131, 132]. The idea is to produce most quickly as possible a part of the result with the already arrived tuples during the waiting of operand tuples.
224
A. Hameurlain and F. Morvan
In the rest of the section, we present the specific operators to the data integration, at first, then we describe both types of dynamic optimization methods: centralized and decentralized. 5.1 Operators for Restricted Source Access Consider the execution plan presented previously that needs a relational join between Employee (empId, name) and Phone (name, phoneNumber) tables. The tables can be modeled with the concept of ‘binding patterns’ as introduced in [108]. Binding patterns can be attached to the relational table to describe its access restrictions due to the reasons of confidentiality or performance issues. A binding pattern for a table R(X1, X2, . . . , Xn) is a partial mapping from {X1, X2, . . . , Xn} to the alphabet {b, f} [93]. For those attributes mapped to ‘b’, the values should be supplied in order to get information from R while the attributes mapping to ‘f’ do not require any input in order to return tuples from R. If all the attributes of R are mapped to ‘f’ then it is possible to get all the tuples of R without any restriction (e.g. with a relational scan operator). The binding patterns of the tables of our example are as follows: Employee (empIdf, namef), and Phone (nameb, phoneNumberf). It means that the Employee table is ready to return the values of the empId, and the name while the Phone table can give the phoneNumber only if the value of the name attribute is known. Regular set of relational operators are insufficient in order to answer queries in the presence of restricted sources. Although we can model the restricted sources with formalization of ‘binding patterns’, due to the access restrictions of the sources, we cannot use the query processing operators, like relational scan and relational join. In the example, in order to get the phoneNumber we have to give the values of the name attribute. So we need a new scan operator which is able to deal with the restricted sources. We quote this operator DAccess as D indicates its dependency on the values of the input attribute(s). While the relational scan operator always returns the same result set, this new operator DAccess returns different sets depending on its input set. Formal semantics of DAccess is as follows: Consider a table R(Xb, Yf) and χ be a set of values for X. Then, DAccess(R(Xb, Yf))χ =σ X∈χ(R(X, Y)) [93]. We noticed that to make the join between Employee (empIdf, namef), and Phone (nameb, phoneNumberf) we need a new join operator known as dependent join [46], represented by the symbol . The representation of the dependent join is T←Scan(R1(Uf, Vf)) V=X DAccess(R2(Xb, Yf)). The hash dependent join consists in building a hash table from R1 and at the same time the distinct values of the attribute(s) V are retrieved and stored them into a table P. P is given to the DAccess operator to compute R2’ = σ X∈P (R2(X, Y)). Then the hash table is probed with R2’ to compute the result. 5.2 Centralized Dynamic Optimization Methods in Data Integration Systems In this sub-section, we present some dynamic optimization methods and techniques where the type of decision-making is centralized. We classify these methods according
Evolution of Query Optimization Methods
225
to the modification level of execution plans. This modification can be taken either on the intra-operator level, or on the inter-operator level. 5.2.1 Modification of Execution Plans on the Intra-operator Level The sub-optimality of execution plans can be modified during the execution of an operator (intra-operator). With this objective, two approaches were proposed: the first one is based on the routing of tuples named Eddy [7], and the second one is based on the dynamic partitioning of data [74]. Avnur and Hellerstein [7] proposed a mechanism named Eddy for query processing which updates continuously the execution schedule of operators in order to adapt to the changes in execution environment. Eddy can be considered as a router of tuples positioned between a number of data sources and a set of operators. Each operator must have one or two input queues to receive the tuples sent by Eddy and an output queue to return the result tuples to Eddy. The tuples received by an Eddy are redirected towards the operators in different orders. Thus, the scheduling of operators is encapsulated by the dynamic routing of tuples. The key point in Eddy is the routing of tuples. Thus, the policy of the tuple routing must be efficient and intelligent in order to minimize the query response time. For that purpose, several authors [32, 109] suggest to extend Eddy's mechanism to improve the quality of the routing. Dynamic data partitioning was proposed by Ives et al. [74]. It corrects the suboptimality of execution plans relying on dynamic data partitioning. In this method, a set of execution plans is associated to each query which will be executed either in parallel or in sequence on separate data partitions. The execution plan of a query is constantly supervised at runtime, and it can be replaced by a new plan in the case where the current plan is considered to be sub-optimal. The tuples which are processed by each used plan represent a data partitioning. When an execution plan is replaced, a new data partitioning is produced. Each used execution plan produces a part of the total result from the associated data partitioning during the query execution. The union of the tuples produced by the various used execution plans provides only part of the total result. Thus, to calculate the final result of the query, it must also calculate the results of all the combinations of various data partitioning. This method is similar to that of Eddy [7]. But contrary to Eddy which uses a local decision routing, this method is based on more total information to generate the new plans. The main difference is that the decision to suspend or replace an execution plan by another one is made by the optimizer. 5.2.2 Modification of Execution Plans on the Inter-operator Level A solution to correct the sub-optimality of execution plans consists in changing the operation scheduling at runtime. The works of Amsaleg et al. [3] take into account the delays in data arrival rates. They have identified three types of delays: (i) Initial delay: that occurs before the arrival of the first tuple, (ii) bursty arrival: the data arrive in bursts but the arrival of these data is suddenly stopped and followed by a long period of no arrival, and (iii) slow delivery: the data arrive regularly but slower than normal. To deal with these delays, two methods were proposed by Amsaleg et al. [4] and by Bouganim et al. [14, 15].
226
A. Hameurlain and F. Morvan
The technique of query scrambling [3, 4] was proposed to process the blockings caused by the delays in data arrival rates. It tries to mask these delays by the executions of other portions of the execution plan until the termination of these delays. The technique of query scrambling processes the initial delay and the bursty arrival in two phases [3]: 1. Re-scheduling: as soon as a delay is detected, this phase is invoked. It begins with the separation of the relational operators of an execution plan in two disjoined sets: (i) the set of blocked operators that contains all the ancestors of unavailable operands, and (ii) the set of executable operators that contains the remainder of the operators that do not belong to the set of blocked operators. Then, a maximum executable sub-tree is extracted from the set of the executable operators. This maximum sub-tree is executed and its intermediate result is materialized. 2. Synthesis: this phase is invoked if the set of the executable operators is empty and the set of the blocked operators is not empty. Contrary to the re-scheduling phase, the synthesis phase can significantly change the execution plan by adding new operators and/or by removing existing operators. The synthesis phase starts, at first, by the construction of a graph of the joins which are ready to be executed. Then, a join is processed and the result is materialized. The synthesis phase is finished if all delays are finished, or if the graph is reduced to only one node or several nodes without join predicates. The technique of query scrambling supposes that an execution plan is executed without taking into account the delays in data arrival rates during plan execution. For that, Bouganim et al. [14, 15] proposed a strategy where the memory is available and data arrival rates are constantly supervised. This information is used to produce a new scheduling between the various fragments of the execution plan or to re-optimize the remainder of the query. The paper of Ives et al. [72] described a dynamic optimization method which is able to deal with the majority of the changes in execution environment (delays, errors and unavailable memory). This method interweaves the phases of optimization and execution and it uses specific dynamic operators. In this method, the optimizer transforms a query into an annotated execution plan [77] and generates the associated rules with Event-Condition-Action type. These rules determine the behavior of the execution plan according to the changes at runtime. They check certain conditions (e.g. comparison of the sizes of the current temporary relations with those estimated during compilation) when events occur (e.g. delay, memory unavailable) they start actions (e.g. memory re-allocation, re-scheduling or re-optimization). 5.3 Decentralized Dynamic Optimization Methods in Data Integration Systems The decentralized dynamic optimization methods correct the sub-optimality of execution plans by decentralizing the control. The conventional hash join [16] algorithm requires the reception of all tuples of the first operand for building the hash table before beginning the probe step. Thus, the time to produce the first tuple can be long if: (i) the size of the operands is large, or (ii) when the data arrival rate is irregular. Contrary to the conventional hash join, the double hash join (DHJ) introduced by Ives et al. [72] built a hash table for each operand. When a tuple arrives, it is inserted firstly in the associated hash table. Then, it is used to probe the other hash table. If the probe step allows
Evolution of Query Optimization Methods
227
to produce result tuples, then these tuples are immediately delivered. DHJ was proposed in TUKWILA project [72] to deal with the problems of conventional hash join in the context of data integration: (i) the production time of the first tuple is minimized, (ii) the optimizer does not need to know the sizes of the operands in order to choose the operand used in the building of the hash table, and (iii) it masks the slow arrival rate of tuples from an operand by processing the tuples of the other operand. However, DHJ requires to maintain the two hash tables in memory. This can limit the use of DHJ with operands having large sizes or with queries constituted of several joins. To solve this problem, parts of the hash tables residing in the memory are moved towards a secondary storage space. When the memory becomes saturated, a partition of one of the two tables is chosen to be moved towards the secondary storage space. The DHJ allows reducing the necessary time for the production of the first tuple of result. Moreover, it makes it possible to continue the production of the result tuples in spite of the unavailability of any one of the two operands. However, it can lead to bad performances if the tuple productions of the two operands are blocked. For that, the Xjoin operator is proposed by Urhan et Franklin [131]. When Xjoin detect the unavailability of the tuples of each operand, the tuples of a portion resident in the secondary storage space are joined with the tuples of the same partition of second operand residing in memory. To accelerate the production of result tuples, it is interesting to define scheduling mechanisms between the various phases of the Xjoin operator. For that purpose, Urhan and Franklin [132] proposed a scheduling technique using the notion of Stream. Stream is the execution unit which consumes and produces tuples. The execution schedule of Stream is determined at runtime and is changed according to the variations of the system behaviour (productions of tuples, terminated streams).
6 Query Optimization in Large Scale Environments 6.1 Query Optimization in Large Scale Data Integration Systems Large scale environment means [58]: (i) high numbers of data sources (e.g. databases, xml files), users, and computing resources (i.e. CPU, memory, network and I/O bandwidth) which are heterogeneous and autonomous, (ii) the network bandwidth presents, in average, a low bandwidth and strong latency, and (iii) huge volumes of data. In a large scale distributed environment, performances of previous optimization methods decrease because: (i) the number of messages relatively important on a network with low bandwidth and strong latency, and (ii) the bottleneck that forms the optimizer. It becomes thus convenient to make the query execution autonomous and self-adaptable. In this perspective, two close approaches have been investigated: the broker approach [28], and the mobile agent approach [6, 76, 101, 110]. The second approach consists in using a programming model based on mobile agents [40], knowing that at present the mobile agent platforms supply only migration mechanisms, but they do not offer proactive migration decision policy. The rest of this sub-section is devoted to describe execution models associated to brokers and mobile agent approaches [6, 28, 66, 76, 98, 101, 110].
228
A. Hameurlain and F. Morvan
Broker Approach In a large scale mediation system context, Collet and Vu [28] proposed an execution model based on brokers. The broker, which is the basic unit of the query execution, supervises the execution of a sub-query. It detects the estimation inaccuracies and adapts itself according to these inaccuracies. Moreover, it communicates with the other brokers to take into account the updates of the execution environment. The principal components of a broker are: (i) context including the annotations and constraints necessary for the execution of a sub-query, (ii) operator of the sub-query, (iii) buffer allowing to synchronize the data exchange between the brokers, and (iv) rules which define behavior of the broker according to changes of the execution environment. Mobile Agent Approaches A mobile agent [40] is an autonomous software entity which can move (code, data, and execution state) from a site to another in order to carrying out a task. In the traditional operating system, the decision of migration activity is controlled by another process. However, in a mobile agent, the decision of the migration activity is made by the agent itself. The operators of double hash join and Xjoin improve the local processing cost by adapting the use of resources CPU, I/O and memory with the changes of the execution environment (e.g. estimation errors, delays in data arrivals rates) and does not take in account the network resource. In objective to take into account the network resource, the work proposed by Arcangeli et al. [6], Hussein et al. [66] and Ozakar et al. [101] based on mobile agents extend the algorithms of direct join, semi-join based join and dependent join (in presence of binding patterns). This extension allows them to change their execution sites proactively. Each mobile agent executing a join chooses itself its execution site by adapting to the execution environment (e.g. CPU load, bandwidth) and the estimation accuracies on temporary relation sizes. Hence, the control which makes the decision of the execution site change is carried out in a decentralized and autonomous way. Furthermore, for dynamic query optimization, Morvan et al. [97] proposed three cooperation methods between the mobile join agents. These methods allow to a mobile agent to make its decision to migrate or not according to the decisions of the other agents communicating with it. These methods minimize the number of messages exchanged between agents. As far as work of Jones and Brown [76], they propose, for large scale distributed queries, an execution model based on mobile agents which react to the estimations inaccuracies. The mobile agents are charged to execute the local sub-queries of an execution plan. These agents compare the partial results (e.g. size, execution costs) with the estimations used during compilation in order to detect sub-optimality. By taking into account the possibility of migration of mobile agents, two strategies were proposed: 1. Decentralized execution without migration: the agents, executing sub-queries, communicate between them, by broadcasting their partial execution states, in order to produce an execution plan for the remainder of the query. 2. Decentralized Execution with migration: this strategy extends the previous strategy while allowing the agents to migrate from one site to another before beginning their executions. The decision of migration can be made in a distributed, individual or centralized way.
Evolution of Query Optimization Methods
229
Another method based on mobile agents has been proposed by [110] in order to execute queries in a web context. In this context, the query result can correspond to a new query on another server which processes it. For this, two mechanisms were proposed which are also known as being parts of LDAP (Lightweith Directory Access Protocol) [64]: (i) referral which consists into return to the user, the new query and server address to process it, and (ii) chaining which consists in cooperating with the server executing the new query to produce the result. In this approach [110], the mobile agents are used to exploit these two mechanisms in the query processing. Each query is processed by using a mobile agent which can choose the best adapted mechanism (referral and chaining). 6.2 Query Optimization in Data Grid Systems Since more than ten years, the grid systems are very active research topics. The main objective of grid computing [39] is to provide a powerful and platform which supplies resources (i.e. computational resources, services, metadata and data sources). The grid computing is very important for scale distributed systems and applications that require effective management distributed and heterogeneous resources [58]. Large scale and dynamicity of nodes (unstable system) characterize the grid systems. Dynamicity of nodes (system instability) means that a node can join, leave or fail at any time. Today, the grid computing, intended initially for the intensive computing, open towards the management of voluminous, heterogeneous, and distributed data on a large-scale environment. Grid data management [104] raises new problems and presents real challenges such as resource discovery and selection, query processing and optimization, autonomic management, security, and benchmarking. To tackle these fundamental problems [104], several methods have been proposed [5, 30, 48, 49, 65, 94, 129]. A very good and complete overview addressing the most above fundamental problems is described in [104]. The authors discuss a set of open problems and new issues related to Grid data management using, mainly, Peer-to-Peer P2P techniques [104]. More focused on a specific and very hot problem such as resource discovery, [129] propose a complete review of the most promising Grid systems that include P2P resource discovery methods by considering the three main classes of P2P systems: unstructured, structured, and hybrid (super-peer). The advantages and weaknesses of a part of proposed methods are described in [104, 129]. The rest of this sub-section tries to provide an overview of query processing and optimization in data grid systems. Several approaches have been proposed for distributed query processing (DQP) in data grid environments [2, 5, 48, 49, 50, 65, 115, 135]. Smith et al. [115] tackle the role of DQP within the Grid and determine the impact of using Grid for each step of DQP (e.g. resource selection). The properties of grid systems such as flexibility and power make grid systems suitable platforms for DQP [115]. In recent years, convergence between grid technologies and web services leads researchers to develop standardized grid interfaces. Open Grid Services Architecture OGSA [38] is one of the most well known standards used in grids. Many applications are developed by using OGSA standards [2, 5, 135]. OGSA-DQP [2] is a high level data integration tool for service-based Grids. It is built on a Grid middleware named OGSA-DAI [5] which provides a middleware that assists its users by accessing and
230
A. Hameurlain and F. Morvan
integrating data from separate sources via the Grid. [135] describes the concepts that provide virtual data sources on the Grid and that implement a Grid data mediation service which is integrated into OGSA-DAI. By analyzing the approaches of DQP on the Grid, the research community focused on the current adaptive query processing approaches [7, 47, 62, 67, 74] and proposed extensions in grid environments [29, 48, 50]. These studies achieve query optimization, by providing efficient resource utilization, without considering parallelization. Although, they use different techniques, most of the studies profit existing monitoring systems to determine progress of the queries. In [48], Gounaris et al. highlighted the importance and challenges of DQP in Grids. They mentioned the necessity of grids by emphasizing increasing demand for computation in the distributed databases. They also explained the challenges in developing adaptive query processing systems by expressing the weaknesses of existing studies and key points for the solutions. After giving the challenges, Gounaris et al. [50] proposed an adaptive query processing algorithm. They introduced an algorithm which provides both a resource discovery/allocation mechanism and a dynamic query processing service. In [114], Slimani et al. developed a cost model by modeling the network characteristics and heterogeneity. By using this cost model, they also introduced a query optimization method on top of Beowulf clusters [34]. They considered both logical and physical costs and deployed the distributed query according to the cheapest cost model. In [29], Cybula et al. introduced a different technique for query optimization which is based on caching of query results. They developed a query optimizer which stores results of queries inside the middleware and used the cache registry to identify queries that need not be reevaluated. As far as parallelism dimension integration, many authors have re-studied DQP in order to be efficiently adopted by considering the properties (e.g. heterogeneity) of grids. Several methods are proposed in this direction [13, 30, 49, 89, 106, 116] which define different algorithms for parallel query processing in grid environments. The proposed methods consider different forms of parallelism (e.g. pipelined parallelism), whereas all of them consider also resource discovery and load balancing. In [13], Bose et al. examined the problem of efficient resource allocation for query sub-plans. They developed their algorithm by exploiting the bushy query trees. They incrementally distributed the sub-queries until a stopping condition is satisfied. In [30, 106] the authors introduced an adaptive parallel query processing middleware for the Grid. They developed a distributed query optimization strategy which is then integrated with a grid node scheduling algorithm by considering runtime statistics of the grid nodes. Gounaris et al. [49] proposed an algorithm which optimizes parallel query processing in grids by iteratively increasing the number of nodes which execute the parallelizable sub-plans. In [89], Liu et al. presented a query optimization algorithm which grades the nodes according to their capacities. They determined serial and parallel parts of the queries and proposed an execution sequence in highest ranked nodes. Soe et al. [116] proposed a parallel query optimization algorithm. In their study, they considered resource allocation, intra-query parallelism and inter-query parallelism by analyzing bushy query trees.
Evolution of Query Optimization Methods
231
7 Discussion According to the discussion led in the section 2.4, and the results in [122, 123, 68, 69, 70, 84], it is difficult to conclude about the superiority of a search strategy (e.g. scheduling of the join operators) with regard to the one another . However, each of them proposes a solution to improve the performances of these algorithms. Ioannidis and Kong [69, 70] chose to propose a new algorithm, called Two Phase Optimization [69], which consists in applying, at first, the Iterative Improvement algorithm, and then, the Simulated Annealing algorithm. As for Swami [123], he chose to experiment a set of heuristics with the aim of improving the performances of the Iterative Improvement and the Simulated Annealing algorithms [123]. The works of Ioannidis and Kong was able to show that the choice of a join method has no direct influence on the performances of the search strategies. In a parallel environment, Lanzelotte and al. [86] showed that the search strategy in breath first is not applicable in a bushy search space for queries with 9 relations or more. The use of a random algorithm is then indispensable. The authors thus developed a random algorithm called Toured Simulated Annealing in a context of parallel processing [86]. The search strategies find the optimal solution more or less quickly according to their capacity to face the various problems. They must be adaptable to queries of diverse sizes (simple, medium, complex) and in various types of use (i.e. ad-hoc or repetitive) [54, 83]. A solution to this problem is the parameterization and the extensibility of query optimizers [71, 83] possessing several search strategies, each being adapted for a type of queries. The major contributions in this domain arise, mainly, from the Rodin project [83, 84, 85, 86] as well as on the Ioannidis and Kong’s results [69]. Indeed, one of the main aspects studied by Lanzelotte in [83] concerns the extensibility of the search strategy for the optimizer, demonstrated by the implementation of four different strategies: System R, Augmented Heuristic, Iterative Improvement and Simulated Annealing. Lanzelotte is especially interested in the query optimization in new systems such as oriented object and deductive DBMS, and proposes an extensible optimizer OPUS (OPtimizer for Up-to-date database Systems) [83] for these non conventional DBMS. Recently, Bizarro et al. [11] proposed “Progressive Parametric Query Optimization” which presents a novel framework to improve the performance of processing parameterized queries. As far as parallel database systems, a synthesis dedicated to parallel relational query optimization methods and approaches [57] has been provided in section 3. In a static context [57], the most advanced works are certainly those of Garofalakis and Ioannidis [44, 45]. They extend elegantly the propositions of [23, 41, 60] where the algorithms of parallel query are based on a uni-dimensional cost model. Furthermore, [45] tackle the scheduling problem (i.e. parallelism extraction) and the resource allocation in a context, which can be multi-query by considering a multidimensional model of used resources (i.e. preemptive, and non-preemptive). The proposals of [45] seem to be the richest in terms of categories of considered resources (i.e. multiresource allocation), exploited parallelisms, and various allocation constraints. In a dynamic context, the efforts were mainly centered on the handling of the following problems: (i) the determination and the dynamic adaptation of the intra-operation parallelism degree, (ii) the methods of resource allocation, and (iii) the dynamic query re-optimization. We identified a set of relevant parameters, mainly: search space,
232
A. Hameurlain and F. Morvan
strategy generation of a parallel execution plan, optimization cost for parallel execution, and cost model. These parameters allow: (i) to compare the two optimization approaches (i.e. one-phase, two-phase), and (ii) to help in the choice of an optimal exploitation of parallel optimization approaches according to the query characteristics and the shape of search space. In a distributed database environment, static query optimization methods are focused mainly on the optimization of inter-site communication costs, by reducing the data volume transferred between sites. Dynamic query optimization methods are based on dynamic scheduling (or re-scheduling) of inter-site operators to correct the sub-optimality due to the inaccuracies of estimations and variations of available resources. The introduction of a new operator, semi-join based join [10, 25], provides certainly more flexibility to optimizers. However, it increases considerably the size of search space. Heterogeneity and autonomy of data sources characterize data integration systems. Sources might be restricted due to the limitation of their query interfaces or certain attributes must be hidden due to privacy reasons. To handle the limited query capabilities of data sources, new mechanisms have been introduced [46, 93], such as, Dependant Join Operator which is asymmetric in nature. The asymmetry of this operator causes the search space to be restricted and raises the issue of capturing valid (feasible) execution plans [92, 93, 139]. As for the optimization methods, the community quickly noticed that the centralized optimization methods [4, 7, 14, 15, 72, 73, 74, 77 ] could not be scaled up for the reasons which are previously pointed out. So, dynamic optimization methods were decentralized by leaning, mainly, on the brokers or on the mobile agents which allow decentralizing the control and scaling up. However, it is important to observe that the decentralized dynamic methods described in sub-section 5.3 build both two hash tables (one for each operand relation). So, they do not apply to restricted data sources. Indeed, a restricted data source returns a result, only if all attributes which are mapped to ' b ' are given. In grid environments, which are characterized by large scale and dynamicity of nodes (system instability), distributed query optimization methods are focused on two aspects: (i) proposed execution models react to state of resources by using monitoring services [36, 49, 137] and (ii) considering different forms and types of parallelism (inter-query parallelism, intra-query parallelism). Moreover, heterogeneity, autonomy, large scale and dynamicty of nodes raise new problems and present real challenges to design and develop acceptable cost models [1, 35, 42, 43, 99, 114, 141]. Indeed, for instance, the statistics describing the data stemming from sources and the formulae associated with the operations processed by these sources cannot be often published [35]. In a large scale environment, whatever the approach of the used cost model is (i.e. history approach [1], calibration approach [43, 141], generic approach [99]) the statistics stored in the catalog are subject to obsolescence [66], which generates large variations between parameters estimated at compile time and parameters computed at runtime. In consequence, it is not realistic to replicate a cost model on all sites. This cost model should be distributed and partially replicated [66, 58]. In an execution model based on mobile agents, a part of cost model should be embedded in mobile agents. This, ensures the autonomy of mobile joins and avoids distant interactions with the site on which was emitted the query [66].
Evolution of Query Optimization Methods
233
Finally, from this state of the art, we can point out the following main characteristics of query optimization methods [98]:
− − − − −
− −
Environment: query optimization methods have designed and implemented in different environments as uni-processor, parallel, distributed, and large scale. Type of method: a query optimization method can be static or dynamic. Search Space. this space can be restricted according to the nature of the considered execution plans, the limited capabilities of data sources, and the applied search strategy. Nature of decision-making: can be centralized or decentralized. The decentralized dynamic optimization methods correct the sub-optimality of execution plans by decentralizing the control. Type of modification: can be, mainly, re-optimization or re-scheduling. When the sub-optimality of an execution plan is detected, correction could be made by reoptimization process or by a re-scheduling process. Re-optimization process: consists in producing a new execution plan for the remainder of the query [77]. The physical implementation, the scheduling and the tree structure of operators which are not yet been executed can be updated. As far as re-scheduling process, the tree structure of the remainder of the execution plan remains unchanged. But, scheduling between the operators can be modified. Level of modification: can occur at intra-operator level or inter-operator level. The sub-optimal execution plan can be corrected during the execution of an operator and/or at sub-query level. Type of event: a dynamic query optimization method can react to following events: (i) estimation errors, (ii) available memory, (iii) delays in data arrival rates, and (iv) user preferences.
These parameters allow comparing proposed optimization methods, and pointing out their advantages and weaknesses. A comparison study of dynamic optimization methods is described in detail in [98]. Furthermore, in a large scale environment, the benefits of mobile agents depending on estimation errors of temporary relation sizes, network bandwidth, and processor frequency, seem to be very promising due to their autonomy and proactive behavior.
8 Conclusion Researches related to relational query optimization goes back to the 70s, and began with the publication of two papers [112, 138]. These papers and relevant applications requirements motivated a large part of the database community to focus their efforts and energies on this topic. Because of the importance and the complexity of the query optimization problem, the database community has proposed approaches, methods and techniques in different environments (uni-processor, parallel, distributed, large scale). In this paper, we wanted to provide a survey related to evolution of query optimization methods from centralized relational database systems to data grid systems through parallel and distributed database systems and data integration (mediation)
234
A. Hameurlain and F. Morvan
systems. For each environment, we described some query optimization methods, and pointed out their main characteristics which allow comparing them.
Acknowledgement We would like to warmly thank Professor Roland Wagner for his kind invitation to write this paper.
Permissions 57.
58.
98.
Hameurlain, A., Morvan, F.: Parallel query optimization methods and approaches: a survey. Journal of Computers Systems Science & Engineering 19(5), 95–114 (2004) Hameurlain, A., Morvan, F., El Samad, M.: Large Scale Data management in Grid Systems: a Survey. In: IEEE Intl. Conf. on Information and Communication Technologies: from Theory to Applications, pp. 1–6. IEEE CS, Los Alamitos (2008) Morvan, F., Hameurlain, A.: Dynamic Query Optimization: Towards Decentralized Methods. Intl. Jour. of Intelligent Information and Database Systems (to appear, 2009)
Section 1 contains materials from [98] with kind permissions from Inderscience. Section 3 contains materials from [57] with kind permissions from CRL Publishing. Section 5 contains materials from [98] with kind permissions from Inderscience. Section 6 and 7 contain materials from [58, 98] with kind permissions from IEEE and Inderscience.
References 1. Adali, S., Candan, K.S., Papakonstantinou, Y., Subrahmanian, V.S.: Query Caching and Optimization in Distributed Mediator Systems. In: Proc. of ACM SIGMOD Intl. Conf. on Management of Data, pp. 137–148. ACM Press, New York (1996) 2. Alpdemir, M.N., Mukherjee, A., Gounaris, A., Paton, N.W., Fernandes, A.A.A., Sakellariou, R., Watson, P., Li, P.: Using OGSA-DQP to support scientific applications for the grid. In: Herrero, P., S. Pérez, M., Robles, V. (eds.) SAG 2004. LNCS, vol. 3458, pp. 13– 24. Springer, Heidelberg (2005) 3. Amsaleg, L., Franklin, M.J., Tomasic, A., Urhan, T.: Scrambling query plans to cope with unexpected delays. In: Proc. of the Fourth Intl. Conf. on Parallel and Distributed Information Systems, pp. 208–219. IEEE CS, Los Alamitos (1996) 4. Amsaleg, L., Franklin, M., Tomasic, A.: Dynamic query operator scheduling for widearea remote access. Distributed and Parallel Databases 6(3), 217–246 (1998) 5. Antonioletti, M., et al.: The design and implementation of Grid database services in OGSA-DAI. In: Concurrency and Computation: Practice & Experience, vol. 17, pp. 357– 376. Wiley InterScience, Hoboken (2005)
Evolution of Query Optimization Methods
235
6. Arcangeli, J.-P., Hameurlain, A., Migeon, F., Morvan, F.: Mobile Agent Based SelfAdaptive Join for Wide-Area Distributed Query Processing. Jour. of Database Management 15(4), 25–44 (2004) 7. Avnur, R., Hellerstein, J.-M.: Eddies: Continuously Adaptive Query Processing. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, vol. 29, pp. 261–272. ACM Press, New York (2000) 8. Babu, S., Bizarro, P., De Witt, D.J.: Proactive re-optimization. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 107–118. ACM Press, New York (2005) 9. Bancilhon, F., Ramakrishnan, R.: An Amateur’s Introduction to Recursive Query Processing Strategies. In: Proc. of the 1986 ACM SIGMOD Conf. on Management of Data, vol. 15, pp. 16–52. ACM Press, New York (1986) 10. Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Rothnie Jr.: Query Processing in a System for Distributed Databases (SDD-1). ACM Trans. Database Systems 6(4), 602– 625 (1981) 11. Bizarro, P., Bruno, N., De Witt, D.J.: Progressive Parametric Query Optimization. IEEE Transactions on Knowledge and Data Engineering 21(4), 582–594 (2009) 12. Bonneau, S., Hameurlain, A.: Hybrid Simultaneous Scheduling and Mapping in SQL Multi-query Parallelization. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 88–99. Springer, Heidelberg (1999) 13. Bose, S.K., Krishnamoorthy, S., Ranade, N.: Allocating Resources to Parallel Query Plans in Data Grids. In: Proc. of the 6th Intl. Conf. on Grid and Cooperative Computing, pp. 210–220. IEEE CS, Los Alamitos (2007) 14. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: A dynamic query processing architecture for data integration systems. Journal of IEEE Data Engineering Bulletin 23(2), 42–48 (2000) 15. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: Proc. of the 16th Intl. Conf. on Data Engineering, pp. 425–434. IEEE CS, Los Alamitos (2000) 16. Bratbergsengen, K.: Hashing Methods and Relational Algebra Operations. In: Proc. of 10th Intl. Conf. on VLDB, pp. 323–333. Morgan Kaufmann, San Francisco (1984) 17. Brunie, L., Kosch, H.: Control Strategies for Complex Relational Query Processing in Shared Nothing Systems. SIGMOD Record 25(3), 34–39 (1996) 18. Brunie, L., Kosch, H.: Intégration d’heuristiques d’ordonnancement dans l’optimisation parallèle de requêtes relationnelles. Revue Calculateurs Parallèles, numéro spécial: Bases de données Parallèles et Distribuées 9(3), 327–346 (1997); Ed. Hermès 19. Brunie, L., Kosch, H., Wohner, W.: From the modeling of parallel relational query processing to query optimization and simulation. Parallel Processing Letters 8, 2–24 (1998) 20. Bruno, N., Chaudhuri, S.: Efficient Creation of Statistics over Query Expressions. In: Proc. of the 19th Intl. Conf. on Data Engineering, Bangalore, India, pp. 201–212. IEEE CS, Los Alamitos (2003) 21. Chaudhuri, S.: An Overview of Query Optimization in Relational Systems. In: Symposium in Principles of Database Systems PODS 1998, pp. 34–43. ACM Press, New York (1998) 22. Chawathe, S.S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D., Widom, J.: The TSIMMIS Project: Integration of Heterogeneous Information Sources. In: Proc. of the 10th Meeting of the Information Processing Society of Japan, pp. 7–18 (1994)
236
A. Hameurlain and F. Morvan
23. Chekuri, C., Hassan, W.: Scheduling Problem in Parallel Query Optimization. In: Symposium in Principles of Database Systems PODS 1995, pp. 255–265. ACM Press, New York (1995) 24. Chen, M.S., Lo, M., Yu, P.S., Young, H.S.: Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins. In: Proc. of the 18th VLDB Conf., pp. 15–26. Morgan Kaufmann, San Francisco (1992) 25. Chiu, D.M., Ho, Y.C.: A Methodology for Interpreting Tree Queries Into Optimal SemiJoin Expressions. In: Proc. of the 1980 ACM SIGMOD, pp. 169–178. ACM Press, New York (1980) 26. Christophides, V., Cluet, S., Moerkotte, G.: Evaluating Queries with Generalized Path Expression. In: Proc. of the 1996 ACM SIGMOD, vol. 25, pp. 413–422. ACM Press, New York (1996) 27. Cole, R.L., Graefe, G.: Optimization of dynamic query evaluation plans. In: Proc. of the 1994 ACM SIGMOD, vol. 24, pp. 150–160. ACM Press, New York (1994) 28. Collet, C., Vu, T.-T.: QBF: A Query Broker Framework for Adaptable Query Evaluation. In: Christiansen, H., Hacid, M.-S., Andreasen, T., Larsen, H.L. (eds.) FQAS 2004. LNCS, vol. 3055, pp. 362–375. Springer, Heidelberg (2004) 29. Cybula, P., Kozankiewicz, H., Stencel, K., Subieta, K.: Optimization of Distributed Queries in Grid Via Caching. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 387–396. Springer, Heidelberg (2005) 30. Da Silva, V.F.V., Dutra, M.L., Porto, F., Schulze, B., Barbosa, A.C., de Oliveira, J.C.: An adaptive parallel query processing middleware for the Grid. In: Concurrence and Computation: Pratique and Experience, vol. 18, pp. 621–634. Wiley InterScience, Hoboken (2006) 31. Date, C.J.: An Introduction to Database Systems, 6th edn. Addison-Wesley, Reading (1995) 32. Deshpande, A., Hellerstein, J.-M.: Lifting the Burden of History from Adaptive Query Processing. In: Proc. of the 13th Intl. Conf. on VLDB, pp. 948–959. Morgan Kaufmann, San Francisco (2004) 33. De Witt, D.J., Kabra, N., Luo, J., Patel, J.M., Yu, J.B.: Client-Server Paradise. In: Proc. of the 20th VLDB Conf., pp. 558–569. Morgan Kaufmann, San Francisco (1994) 34. Dinquel, J.: Network Architectures for Cluster Computing. Technical Report 572, CECS, California State University (2000) 35. Du, W., Krishnamurthy, R., Shan, M.-C.: Query Optimization in a Heterogeneous DBMS. In: Proc. of the 18th Intl. Conf. on VLDB, pp. 277–291. Morgan Kaufmann, San Francisco (1992) 36. El Samad, M., Gossa, J., Morvan, F., Hameurlain, A., Pierson, J.-M., Brunie, L.: A monitoring service for large-scale dynamic query optimisation in a grid environment. Intl. Jour. of Web and Grid Services 4(2), 222–246 (2008) 37. Evrendilek, C., Dogac, A., Nural, S., Ozcan, F.: Multidatabase Query Optimization. Journal of Distributed and Parallel Databases 5(1), 77–113 (1997) 38. Foster, I.: The Grid: A New Infrastructure for 21st Century Science. Physics Today 55(2), 42–56 (2002) 39. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004) 40. Fuggetta, A., Picco, G.-P., Vigna, G.: Understanding Code Mobility. IEEE Transactions on Software Engineering 24(5), 342–361 (1998)
Evolution of Query Optimization Methods
237
41. Ganguly, S., Hasan, W., Krishnamurthy, R.: Query Optimization for Parallel Execution. In: Proc. of the 1992 ACM SIGMOD int’l. Conf. on Management of Data, vol. 21, pp. 9– 18. ACM Press, San Diego (1992) 42. Ganguly, S., Goel, A., Silberschatz, A.: Efficient and Accurate Cost Models for Parallel Query Optimization. In: Symposium in Principles of Database Systems PODS 1996, pp. 172–182. ACM Press, New York (1996) 43. Gardarin, G., Sha, F., Tang, Z.-H.: Calibrating the Query Optimizer Cost Model of IRODB, an Object-Oriented Federated Database System. In: Proc. of 22nd Intl. Conf. on VLDB, pp. 378–389. Morgan Kaufmann, San Francisco (1996) 44. Garofalakis, M.N., Ioannidis, Y.E.: Multi-dimensional Resource Scheduling for Parallel Queries. In: Proc. of the 1996 ACM SIGMOD intl. Conf. on Management of Data, vol. 25, pp. 365–376. ACM Press, New York (1996) 45. Garofalakis, M.N., Ioannidis, Y.E.: Parallel Query Scheduling and Optimization with Time- and Space - Shared Resources. In: Proc. of the 23rd VLDB Conf., pp. 296–305. Morgan Kaufmann, San Francisco (1997) 46. Goldman, R., Widom, J.: WSQ/DSQ: A practical approach for combined querying of databases and the web. In: Proc. of ACM SIGMOD Conf., pp. 285–296. ACM Press, New York (2000) 47. Gounaris, A., Paton, N.W., Fernandes, A.A.A., Sakellariou, R.: Adaptive Query Processing: A Survey. In: Eaglestone, B., North, S.C., Poulovassilis, A. (eds.) BNCOD 2002. LNCS, vol. 2405, pp. 11–25. Springer, Heidelberg (2002) 48. Gounaris, A., Paton, N.W., Sakellariou, R., Fernandes, A.A.A.: Adaptive Query Processing and the Grid: Opportunities and Challenges. In: Proc. of the 15th Intl. Dexa Workhop, pp. 506–510. IEEE CS, Los Alamitos (2004) 49. Gounaris, A., Sakellariou, R., Paton, N.W., Fernandes, A.A.A.: Resource Scheduling for Parallel Query Processing on Computational Grids. In: Proc. of the 5th IEEE/ACM Intl. Workshop on Grid Computing, pp. 396–401 (2004) 50. Gounaris, A., Smith, J., Paton, N.W., Sakellariou, R., Fernandes, A.A.A., Watson, P.: Adapting to Changing Resource Performance in Grid Query. In: Pierson, J.-M. (ed.) VLDB DMG 2005. LNCS, vol. 3836, pp. 30–44. Springer, Heidelberg (2006) 51. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing Survey 25(2), 73–170 (1993) 52. Graefe, G.: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994) 53. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: Proc. of 23rd Intl. Conf. on VLDB, pp. 276–285. Morgan Kaufmann, San Francisco (1997) 54. Hameurlain, A., Bazex, P., Morvan, F.: Traitement parallèle dans les bases de données relationnelles: concepts, méthodes et applications. Cépaduès Editions (1996) 55. Hameurlain, A., Morvan, F.: An Overview of Parallel Query Optimization in Relational Systems. In: 11th Intl Worshop on Database and Expert Systems Applications, pp. 629– 634. IEEE CS, Los Alamitos (2000) 56. Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Journal of Parallel Computing 28(4), 525–556 (2002) 57. Hameurlain, A., Morvan, F.: Parallel query optimization methods and approaches: a survey. Journal of Computers Systems Science & Engineering 19(5), 95–114 (2004) 58. Hameurlain, A., Morvan, F., El Samad, M.: Large Scale Data management in Grid Systems: a Survey. In: IEEE Intl. Conf. on Information and Communication Technologies: from Theory to Applications, pp. 1–6. IEEE CS, Los Alamitos (2008)
238
A. Hameurlain and F. Morvan
59. Han, W.-S., Ng, J., Markl, V., Kache, H., Kandil, M.: Progressive optimization in a shared-nothing parallel database. In: Proc.of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 809–820 (2007) 60. Hasan, W., Motwani, R.: Optimization Algorithms for Exploiting the Parallelism - Communication Tradeoff in Pipelined Parallelism. In: Proc. of the 20th int’l. Conf. on VLDB, pp. 36–47. Morgan Kaufmann, San Francisco (1994) 61. Hasan, W., Florescu, D., Valduriez, P.: Open Issues in Parallel Query Optimization. SIGMOD Record 25(3), 28–33 (1996) 62. Hellerstein, J.M., Franklin, M.J.: Adaptive Query Processing: Technology in Evolution. Bulletin of Technical Committee on Data Engineering 23(2), 7–18 (2000) 63. Hong, W.: Exploiting Inter-Operation Parallelism in XPRS. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 19–28. ACM Press, New York (1992) 64. Howes, T., Smith, M.C., Good, G.S., Howes, T.A., Smith, M.: Understanding and Deploying LDAP Directory Services. MacMillan, Basingstoke (1999) 65. Hu, N., Wang, Y., Zhao, L.: Dynamic Optimization of Sub query Processing in Grid Database, Natural Computation. In: Proc of the 3rd Intl Conf. on Natural Computation, vol. 5, pp. 8–13. IEEE CS, Los Alamitos (2007) 66. Hussein, M., Morvan, F., Hameurlain, A.: Embedded Cost Model in Mobile Agents for Large Scale Query Optimization. In: Proc. of the 4th Intl. Symposium on Parallel and Distributed Computing, pp. 199–206. IEEE CS, Los Alamitos (2005) 67. Hussein, M., Morvan, F., Hameurlain, A.: Dynamic Query Optimization: from Centralized to Decentralized. In: 19th Intl. Conf. on Parallel and Distributed Computing Systems, ISCA, pp. 273–279 (2006) 68. Ioannidis, Y.E., Wong, E.: Query Optimization by Simulated Annealing. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 9–22. ACM Press, New York (1987) 69. Ioannidis, Y.E., Kang, Y.C.: Randomized Algorithms for Optimizing Large Join Queries. In: Proc of the 1990 ACM SIGMOD Conf. on the Manag. of Data, vol. 19, pp. 312–321 (1990) 70. Ioannidis, Y.E., Christodoulakis, S.: On the Propagation of Errors in the Size of Join Results. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 268–277. ACM Press, New York (1991) 71. Ioannidis, Y.E., Ng, R.T., Shim, K., Sellis, T.K.: Parametric Query Optimization. In: 18th Intl. Conf. on VLDB, pp. 103–114. Morgan Kaufmann, San Francisco (1992) 72. Ives, Z.-G., Florescu, D., Friedman, M., Levy, A.Y., Weld, D.S.: An adaptive query execution system for data integration. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 299–310. ACM Press, New York (1999) 73. Ives, Z.-G., Levy, A.Y., Weld, D.S., Florescu, D., Friedman, M.: Adaptive query processing for internet applications. Journal of IEEE Data Engineering Bulletin 23(2), 19–26 (2000) 74. Ives, Z.-G., Halevy, A.-Y., Weld, D.-S.: Adapting to Source Properties in Processing Data Integration Queries. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 395–406. ACM Press, New York (2004) 75. Jarke, M., Koch, J.: Query Optimization in Database Systems. ACM Comput. Surv. 16(2), 111–152 (1984) 76. Jones, R., Brown, J.: Distributed query processing via mobile agents (1997), http://www.cs.umd.edu/~rjones/paper.html
Evolution of Query Optimization Methods
239
77. Kabra, N., Dewitt, D.J.: Efficient Mid - Query Re-Optimization of Sub-Optimal Query Execution Plans. In: Proc. of the ACM SIGMOD intl. Conf. on Management of Data, vol. 27, pp. 106–117. ACM Press, New York (1998) 78. Kabra, N., De Witt, D.J.: OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization. VLDB Journal 8, 55–78 (1999) 79. Khan, M.F., Paul, R., Ahmed, I., Ghafoor, A.: Intensive Data Management in Parallel Systems: A Survey. Distributed and Parallel Databases 7, 383–414 (1999) 80. Khan, L., Mcleod, D., Shahabi, C.: An Adaptive Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases. Journal of Database Management 12(4), 3–14 (2001) 81. Kosch, H.: Managing the operator ordering problem in parallel databases. Future Generation Computer Systems 16(6), 665–676 (2000) 82. Kossmann, D.: The State of the Art in Distributed Query Processing. ACM Computing Surveys 32(4), 422–469 (2000) 83. Lanzelotte, R.S.G.: OPUS: an extensible Optimizer for Up-to-date database Systems. PhD Thesis, Computer Science, PUC-RIO, available at INRIA, Rocquencourt, n° TU-127 (1990) 84. Lanzelotte, R.S.G., Valduriez, P.: Extending the Search Strategy in a Query Optimizer. In: Proc. of the Int’l Conf. on VLDB, pp. 363–373. Morgan Kaufmann, San Francisco (1991) 85. Lanzelotte, R.S.G., Zaït, M., Gelder, A.V.: Measuring the effectiveness of optimization. Search Strategies. In: BDA 1992, Trégastel, pp. 162–181 (1992) 86. Lanzelotte, R.S.G., Valduriez, P., Zaït, M.: On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces. In: Proc. of the Intl Conf. on VLDB, pp. 493– 504. Morgan Kaufmann, San Francisco (1993) 87. Lazaridis, I., Mehrotra, S.: Optimization of multi-version expensive predicates. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 797–808. ACM Press, New York (2007) 88. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: Proc. of the Intl. Conf. on VLDB, pp. 251–262. Morgan Kaufmann, San Francisco (1996) 89. Liu, S., Karimi, H.A.: Grid query optimizer to improve query processing in grids. Future Generation Computer Systems 24(5), 342–353 (2008) 90. Lu, H., Ooi, B.C., Tan, K.-L.: Query Processing in Parallel Relational Database Systems. IEEE CS Press, Los Alamitos (1994) 91. Mackert, L.F., Lohman, G.M.: R* Optimizer Validation and Performance Evaluation for Distributed Queries. In: Proc. of the 12th Intl. Conf. on VLDB, pp. 149–159 (1986) 92. Manolescu, I.: Techniques d’optimisation pour l’interrogation des sources de données hétérogènes et distribuées, Ph-D Thesis, Université de Versailles Saint-Quentin-enYvlenies, France (2001) 93. Manolescu, I., Bouganim, L., Fabret, F., Simon, E.: Efficient querying of distributed resources in mediator systems. In: Meersman, R., Tari, Z., et al. (eds.) CoopIS 2002, DOA 2002, and ODBASE 2002. LNCS, vol. 2519, pp. 468–485. Springer, Heidelberg (2002) 94. Marzolla, M., Mordacchini, M., Orlando, S.: Peer-to-Peer for Discovering resources in a Dynamic Grid. Jour. of Parallel Computing 33(4-5), 339–358 (2007) 95. Mehta, M., Dewitt, D.J.: Managing Intra-Operator Parallelism in Parallel Database Systems. In: Proc. of the 21th Intl. Conf. on VLDB, pp. 382–394 (1995) 96. Mehta, M., Dewitt, D.J.: Data Placement in Shared-Nothing Parallel Database Systems. The VLDB Journal 6, 53–72 (1997)
240
A. Hameurlain and F. Morvan
97. Morvan, F., Hussein, M., Hameurlain, A.: Mobile Agent Cooperation Methods for Large Scale Distributed Dynamic Query Optimization. In: Proc. of the 14th Intl. Workshop on Database and Expert Systems Applications, pp. 542–547. IEEE CS, Los Alamitos (2003) 98. Morvan, F., Hameurlain, A.: Dynamic Query Optimization: Towards Decentralized Methods. Intl. Jour. of Intelligent Information and Database Systems (to appear, 2009) 99. Naacke, H., Gardarin, G., Tomasic, A.: Leveraging Mediator Cost Models with Heterogeneous Data Sources. In: Proc. of the 14th Intl. Conf. on Data Engineering, pp. 351–360. IEEE CS, Los Alamitos (1998) 100. Ono, K., Lohman, G.M.: Measuring the Complexity of Join Enumeration in Query Optimization. In: Proc. of the Int’l Conf. on VLDB, pp. 314–325. Morgan Kaufmann, San Francisco (1990) 101. Ozakar, B., Morvan, F., Hameurlain, A.: Mobile Join Operators for Restricted Sources. Mobile Information Systems: An International Journal 1(3), 167–184 (2005) 102. Ozcan, F., Nural, S., Koksal, P., Evrendilek, C., Dogac, A.: Dynamic query optimization in multidatabases. Data Engineering Bulletin CS 20(3), 38–45 (1997) 103. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. PrenticeHall, Englewood Cliffs (1999) 104. Pacitti, E., Valduriez, P., Mattoso, M.: Grid Data Management: Open Problems and News Issues. Intl. Journal of Grid Computing 5(3), 273–281 (2007) 105. Paton, N.W., Chávez, J.B., Chen, M., Raman, V., Swart, G., Narang, I., Yellin, D.M., Fernandes, A.A.A.: Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options. VLDB Journal 18(1), 119–140 (2009) 106. Porto, F., da Silva, V.F.V., Dutra, M.L., Schulze, B.: An Adaptive Distributed Query Processing Grid Service. In: Pierson, J.-M. (ed.) VLDB DMG 2005. LNCS, vol. 3836, pp. 45–57. Springer, Heidelberg (2006) 107. Rahm, E., Marek, R.: Dynamic Multi-Resource Load Balancing in Parallel Database Systems. In: Proc. of the 21st VLDB Conf., pp. 395–406 (1995) 108. Rajaraman, A., Sagiv, Y., Ullman, J.D.: Answering queries using templates with binding patterns. In: The Proc. of ACM PODS, pp. 105–112. ACM Press, New York (1995) 109. Raman, V., Deshpande, A., Hellerstein, J.-M.: Using State Modules for Adaptive Query Processing. In: Proc. of the 19th Intl. Conf. on Data Engineering, pp. 353–362. IEEE CS, Los Alamitos (2003) 110. Sahuguet, A., Pierce, B., Tannen, V.: Distributed Query Optimization: Can Mobile Agents Help? (2000), http://www.seas.upenn.edu/~gkarvoun/dragon/publications/ sahuguet/ 111. Schneider, D., Dewitt, D.J.: Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines. In: Proc. of the 16th VLDB Conf., pp. 469–480. Morgan Kaufmann, San Francisco (1990) 112. Selinger, P.G., Astrashan, M., Chamberlin, D., Lorie, R., Price, T.: Access Path Selection in a Relational Database Management System. In: Proc. of the 1979 ACM SIGMOD Conf. on Management of Data, pp. 23–34. ACM Press, New York (1979) 113. Selinger, P.G., Adiba, M.E.: Access Path Selection in Distributed Database Management Systems. In: Proc. Intl. Conf. on Data Bases, pp. 204–215 (1980) 114. Slimani, Y., Najjar, F., Mami, N.: An Adaptable Cost Model for Distributed Query Optimization on the Grid. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 79–87. Springer, Heidelberg (2004)
Evolution of Query Optimization Methods
241
115. Smith, J., Gounaris, A., Watson, P., Paton, N.W., Fernandes, A.A.A., Sakellariou, R.: Distributed Query Processing on the Grid. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 279–290. Springer, Heidelberg (2002) 116. Soe, K.M., New, A.A., Aung, T.N., Naing, T.T., Thein, N.L.: Efficient Scheduling of Resources for Parallel Query Processing on Grid-based Architecture. In: Proc. of the 6th Asia-Pacific Symposium, pp. 276–281. IEEE CS, Los Alamitos (2005) 117. Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning Optimizer. In: Proc.of 27th Intl. Conf. on Very Large Data Bases, pp. 19–28. Morgan Kaufmann, San Francisco (2001) 118. Stonebraker, M., Katz, R.H., Paterson, D.A., Ousterhout, J.K.: The Design of XPRS. In: Proc. of the 4th VLDB Conf., pp. 318–330. Morgan Kaufmann, San Francisco (1988) 119. Stonebraker, M., Aoki, P.M., Litwin, W., Pfeffer, A., Sah, A., Sidell, J., Staelin, C., Yu, A.: Mariposa: A Wide-Area Distributed Database System. VLDB Jour. 5(1), 48–63 (1996) 120. Stonebraker, M., Hellerstein, J.M.: Readings in Database Systems, 3rd edn. Morgan Kaufmann, San Francisco (1998) 121. Swami, A.: Optimization of large join queries. Technical report, Software Techonology Laboratory, H-P Laboratories, Report STL-87-15 (1987) 122. Swami, A.N., Gupta, A.: Optimization of Large Join Queries. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 8–17. ACM Press, New York (1988) 123. Swami, A.N.: Optimization of Large Join Queries: Combining Heuristic and Combinatorial Techniques. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 367–376 (1989) 124. Tan, K.L., Lu, H.: A Note on the Strategy Space of Multiway Join Query Optimization Problem in Parallel Systems. SIGMOD Record 20(4), 81–82 (1991) 125. Taniar, D., Leung, C.H.C.: Query execution scheduling in parallel object-oriented databases. Information & Software Technology 41(3), 163–178 (1999) 126. Taniar, D., Leung, C.H.C., Rahayu, J.W., Goel, S.: High Performance Parallel Database Processing and Grid Databases. John Wiley & Sons, Chichester (2008) 127. Tomasic, A., Raschid, L., Valduriez, P.: Scaling Heterogeneous Databases and the Design of Disco. In: Proc. of the 16th Intl. Conf. on Distributed Computing Systems, pp. 449–457. IEEE CS, Los Alamitos (1996) 128. Tomasic, A., Raschid, L., Valduriez, P.: Scaling Access to Heterogeneous Data Sources with DISCO. IEEE Trans. Knowl. Data Eng. 10(5), 808–823 (1998) 129. Trunfio, P., et al.: Peer-to-Peer resource discovery in Grids: Models and systems. Future Generation Computer Systems 23(7), 864–878 (2007) 130. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press (1988) 131. Urhan, T., Franklin, M.: XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin 23(2), 27–33 (2000) 132. Urhan, T., Franklin, M.: Dynamic pipeline scheduling for improving interactive query performance. In: Proc.of 27th Intl. Conf. on VLDB, pp. 501–510. Morgan Kaufmann, San Francisco (2001) 133. Valduriez, P.: Semi-Join Algorithms for Distributed Database Machines. In: Proc. of the 2nd Intl. Symposium on Distributed Data Bases, pp. 22–37. North-Holland Publishing Company, Amsterdam (1982) 134. Valduriez, P., Gardarin, G.: Join and Semijoin Algorithms for a Multiprocessor Database Machine. ACM Trans. Database Syst. 9(1), 133–216 (1984)
242
A. Hameurlain and F. Morvan
135. Wohrer, A., Brezany, P., Tjoa, A.M.: Novel mediator architectures for Grid information systems. Future Generation Computer Systems, 107–114 (2005) 136. Wiederhold, G.: Mediators in the Architecture of Future Information Systems. Journal of IEEE Computer 25(3), 38–49 (1992) 137. Wolski, R., Spring, N.T., Hayes, J.: The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999) 138. Wong, E., Youssefi, K.: Decomposition: A Strategy for Query Processing. ACM Transactions on Database Systems 1, 223–241 (1976) 139. Yerneni, R., Li, C., Ullman, J.D., Garcia-Molina, H.: Optimizing Large Join Queries in Mediation Systems. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 348–364. Springer, Heidelberg (1998) 140. Zhou, Y., Ooi, B.C., Tan, K.-L., Tok, W.H.: An adaptable distributed query processing architecture. Data & Knowledge Engineering 53(3), 283–309 (2005) 141. Zhu, Q., Motheramgari, S., Sun, Y.: Cost Estimation for Queries Experiencing Multiple Contention States in Dynamic Multidatabase Environments. Journal of Knowledge and Information Systems Publishers 5(1), 26–49 (2003) 142. Ziane, M., Zait, M., Borlat-Salamet, P.: Parallel Query Processing in DBS3. In: Proc of the 2nd Intl. Conf. on Parallel and Distributed Information Systems, pp. 93–102. IEEE CS, Los Alamitos (1993)
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems Paulo Leitao Polytechnic Institute of Bragança, Campus Sta Apolonia, Apartado 1134, 5301-857 Bragança, Portugal [email protected]
Abstract. Traditional centralized and rigid control structures are becoming inflexible to face the requirements of reconfigurability, responsiveness and robustness, imposed by customer demands in the current global economy. The Holonic Manufacturing Systems (HMS) paradigm, which was pointed out as a suitable solution to face these requirements, translates the concepts inherited from social organizations and biology to the manufacturing world. It offers an alternative way of designing adaptive systems where the traditional centralized control is replaced by decentralization over distributed and autonomous entities organized in hierarchical structures formed by intermediate stable forms. In spite of its enormous potential, methods regarding the self-adaptation and selforganization of complex systems are still missing. This paper discusses how the insights from biology in connection with new fields of computer science can be useful to enhance the holonic design aiming to achieve more self-adaptive and evolvable systems. Special attention is devoted to the discussion of emergent behavior and self-organization concepts, and the way they can be combined with the holonic rationale. Keywords: Holonic Manufacturing Systems, Bio-inspiration, Emergent Behavior, Self-organization.
1 Introduction Nowadays, manufacturing companies to be competitive in the current global economy, facing the customized customer demands, must stand on cost, quality and responsiveness [1]. Several studies, e.g. the one elaborated by the Manufuture High Level Group of experts, promoted by the European Commission [2], reinforces this idea by pointing out the reconfigurable manufacturing as the highest priority for future research in manufacturing. Re-configurability, that is the ability of a system to dynamically change its configuration, usually to respond to dynamic changes in its environment, provides the way to achieve a rapid and adaptive response to change, which is a key enabler of competitiveness. Traditional manufacturing control approaches typically fall into large monolithic and centralized systems, exhibiting low capacity of adaptation to the dynamic changes of their environment and thus not supporting efficiently the demanded requirements. The quest for re-configurability requires a new class of intelligent and distributed A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 243–266, 2009. © Springer-Verlag Berlin Heidelberg 2009
244
P. Leitao
manufacturing control systems that operate in a totally distinct way when compared with the traditional ones, being the centralized and rigid control structures replaced by the decentralization of control functions over distributed entities. Multi-agent systems [3], derived from distributed artificial intelligence and based on the ideas proposed by Marvin Minsky in his seminal work “The society of Mind” [4], suggest the definition of distributed control based on autonomous agents that account for the realization of efficient, flexible, reconfigurable and robust overall plant control, without any need for centralized control. Other emergent manufacturing paradigms have been proposed during the last years, namely Bionic Manufacturing Systems (BMS) [5] and Holonic Manufacturing Systems (HMS) [6], which uses biological and social organization theories as sources of inspiration. These paradigms are unified in proposing distributed, autonomous and adaptive manufacturing systems, suggesting the idea that hierarchy is needed in order to guarantee the inter-entities conflict resolution and to maintain the overall system coherence and objectivity resulting from the individual and autonomous attitude of the entities [7]. Among others, the Product-ResourceOrder-Staff Architecture (PROSA) [6], the ADAptive holonic COntrol aRchitecture for distributed manufacturing systems (ADACOR) [8], the Holonic ComponentBased Architecture (HCBA) [9] and the P+2000 [10]), are successful examples of the application of the HMS principles. See for example [11] and [12] for a more deeply analysis of the state-of-the-art on agent-based manufacturing control. The application of multi-agent systems and holonic paradigms, by themselves, does not completely solve the current manufacturing problems, being necessary to combine them with mechanisms to support the dynamic structure re-configuration, thus dealing more effectively with unpredicted behavior and minimizing its effects. In other words, questions like how the global production optimization is achieved in decentralized systems, how temporary hierarchies are dynamically formed, evolved and removed, how individual components self-organize and evolve to support the evolution and emergence, and how to adapt their emergent behavior using learning algorithms, are yet far from being answered. In fact, in spite of the interesting potential introduced by these paradigms, methods and tools to facilitate their design and maintenance, in particular regarding to their self-adaptation and self-organization properties, are required. Biology and nature seem suitable sources of inspiration to answer the above questions, achieving better solutions for self-adaptive and evolvable complex systems. A recent article published in the National Geographic magazine reinforces this idea, stating that “the study of swarm intelligence is providing insights that can help humans to manage complex systems” based on the idea that “a single ant or bee isn´t smart but their colonies are” [13]. Biological theories are being successfully used to develop complex adaptive applications, namely in economics, logistics and military applications. As examples, Air Liquide uses an ant-based strategy to manage the truck routes for delivering industrial and medical gases, the Sky Harbor International Airport in Phoenix uses an ant-based model to improve the airlines scheduling [13], and swarm intelligence principles were used to forecast energy demands in Turkey. In this context, this paper discusses the benefits that the bio-inspired theories can bring to the manufacturing world, analyzing how the insights from biology, such as emergence and self-organization, in connection with new fields of computer science,
Holonic Rationale and Bio-inspiration
245
such as artificial life and evolutionary computing, can be applied to power the design of holonic systems aiming to achieve more adaptive and re-configurable systems. The ADACOR holonic approach is used to illustrate the application of some biological inspired concepts to design an adaptive production control system that evolves dynamically between a more hierarchical and a more heterarchical control architectures, based in the self-organization concept. The rest of the paper is organized as follows: Section 2 overviews the holonic rationale applied to complex manufacturing systems. Section 3 illustrates the use of biological theories to build complex systems exhibiting emergent behavior, namely the swarm intelligence principles, and Section 4 discusses how self-organization principles can be applied to achieve self-adaptive, re-configurable and evolvable complex systems. Section 5 summarizes the differences and similarities between emergence and self-organization concepts, pointing out the benefits of combining them with holonic rationale. Section 6 presents the ADACOR example of applying biological inspired concepts to achieve an adaptive production control system. Finally, Section 7 rounds up the paper with the conclusions.
2 Backing to the Roots of Holonic Rationale The manufacturing domain is and will be in the future one of the main wealth generators of the world economy [14], assuming a crucial importance the analysis, the design and the implementation of new and innovative manufacturing systems, in order to maintain those wealth levels and to establish a solid base for economic growth. Manufacturing environment is usually characterized by being: • Non-linear, since it is based on processes that are regulated by non-linear equations where effects are not proportional to causes. • Complex, since the parameters that regulate those processes are interdependent: when changing one, the others are changed, resulting in instability and unpredictable systems. • Chaotic, since some processes work as amplifiers, i.e. small causes may provoke large effects. For example, the occurrence of a small disturbance in a machine may affect the system’s productivity. Additionally, as some disturbance effects can remain in the system after its resolution, its occurrence may have severe impact in the performance of manufacturing systems. • Uncertainty, i.e. the decisions are made based on incomplete and inaccurate information and their execution is subject to uncertainty, e.g. the occurrence of deviations. As result, manufacturing systems are usually unpredictable and very difficult to control. In parallel, manufacturing is constantly subject to pressures from market that demands for customized products at shorter delivery time, which requires the ability to respond and adapt promptly and efficiently to changes. This observation is illustrated by the current tendency to reduce the batch sizes as a consequence of the mass customization era.
246
P. Leitao
The described observations show that manufacturing systems are complex adaptive systems working in a very dynamic and demanding environment with a limited or bounded rationale [15]. Having in mind the particularities of the manufacturing domain, suitable paradigms are required to address the described challenge. HMS is a paradigm, developed under the international Intelligent Manufacturing Systems (IMS) collaborative research programme, which translates the concepts developed by Arthur Koestler into a set of appropriate concepts for manufacturing domain. Koestler introduced the word holon to describe a basic unit of organization in living organisms and social organizations [16], based on Herbert Simon theories and on his observations. Simon observed that complex systems are hierarchical systems formed by intermediate stable forms (see the parable of the watchmakers1), which do not exist as auto-sufficient and non-interactive elements but, on the contrary, they are simultaneously a part and a whole. In fact, complex systems will evolve from simple systems much more rapidly if there are stable intermediate forms than if there are not, i.e. if they are hierarchically organized. Simon’s observation is even more applicable as more complex is the product or the environment. Koestler concluded that, although it is easy to identify sub-wholes or parts, wholes and parts in an absolute sense do not exist anywhere. The word holon, proposed by Koestler, is the representation of this hybrid nature, being a combination of the Greek word holos, which means whole, and the suffix on, which means particle. holon = holos (whole) + on (particle) Koestler also identified important properties of a holon: • Autonomy, where the stability of the holons, i.e. holons as stable forms, result from their ability to act autonomously in case of unpredictable circumstances. • Self-reliance, meaning that the holons are intermediate forms providing a context for the proper functionality of the larger whole. • Cooperation, which is the ability to have holons cooperating, transforming these holons into effective components of bigger wholes. The holonic theory reconciles both the holistic and reductionist approaches, describing the simultaneous application of them. Reductionism approach states that a complex system is nothing but the sum of its parts, and that an account of it can be reduced to accounts of individual constituents. Holism on the other hand is the idea that all properties of a given system cannot be determined or explained by its parts alone, but instead, the system as a whole determines in an important way how the parts behave. 1
The parable tells the story of two excellent watchmakers Tempus and Hora. While Hora is getting richer and richer, Tempus is getting poorer and poorer. A team of analysts makes a visit to both shops and noticed the following. Both watches consists of 1000 parts, but Tempus designed his watch such that, when he had to put down a partly assembled watch, it immediately fell into pieces and had to be reassembled from the basic elements. Hora had designed his watches so that he could put together subassemblies of about ten components each. Ten of these subassemblies could be put together to make a larger sub-assembly. Finally, ten of the larger subassemblies constituted the whole watch. Each subassembly could be put down without falling apart [17].
Holonic Rationale and Bio-inspiration
247
In a manufacturing environment, a holon can represent a physical or logical activity, such as a robot, a machine, an order, a flexible manufacturing system, or even an operator. The holon has a partial view of the world, containing information about itself and the environment. It may comprise an information processing part and a physical processing part, the last one only if the holon represents a physical device, such as an industrial robot [18], as illustrated in Fig. 1.
Fig. 1. Constitution of a Holon
Koestler defines a holarchy as a hierarchically organized structure of selfregulating holons that function first as autonomous wholes in supra-ordination to their parts, secondly as dependent parts in sub-ordination to controls on higher levels, and thirdly in coordination with their local environment [16]. The HMS is a holarchy that integrates the entire range of manufacturing activities, combining the best features of hierarchical and heterarchical organization, i.e. it preserves the stability and predictability of hierarchy while providing the dynamic flexibility and robustness of heterarchy. In HMS, the holons behaviors and activities are determined through the cooperation with other holons, in opposition of being determined by a centralized mechanism. Considering the Janus2 effect, i.e. that holons are simultaneously self-contained wholes to their subordinated parts and dependent parts when seen from the higher levels, it is possible to recursively decompose a holon into several others holons, allowing the reduction of the problem complexity. As an example, illustrated in Fig. 2, a holon belonging to a certain holarch may represent a manufacturing cell, being simultaneously the whole, encapsulating holons representing the cell resources, and the part, when considering the shop floor system. The implementation of the HMS concepts, mainly focusing the high-level of abstraction, can be done using the agent technology, which is appropriate to implement the modularity, decentralization, re-use and complex structures characteristics [19]. 2
Janus was a Roman god with two faces: one looking forward and the other looking back. In this context, one side is looking “down” and acting as an autonomous system giving directions to “lower” components, and the other side is looking “up” and serving as a part of a “higher” holon.
248
P. Leitao PDQXIDFWXULQJFHOO
KRORQ
KRORQ
UHVRXUFH
Fig. 2. Holarchy and the Janus Effect of a Holon
In spite of its promising perspectives, at the moment, the industrial adoption of these approaches has fallen short of expectations, and the implemented functionalities are normally restricting [20]. This weak adoption is due, amongst others, to questions related to the required investment, the acceptance of distributed thinking, technology maturity, engineering development tools and others more related with technical issues, such as interoperability and scalability [11; 20]. However, an important reason contributes significantly for this weak adoption: the traditional design of holonic systems misses the application of self-adaptation and self-organization properties that causes systems to become increasingly more reconfigurable, adaptive, organized and efficient. The challenge faced to overcome this problem is to go back to the roots of holonics provided by Koestler and look for other sources of inspiration that enhance holonic rationale to design more adaptive and evolvable systems, which can be easily deployed into real environments. The vision sustained in this work is that, as illustrated in Fig. 3, besides the infrastructure’s technologies to support ubiquitous, modular and distributed features inherent to these applications, the engineering of emergent and evolvable complex systems will consider the combination of existing distributed collaborative paradigms, such as holonic rationale, with bio-inspiration theories, such as emergent behavior and selforganization, in connection with emergent fields of computer science, such as Artificial Life. Artificial Life [21] is a discipline that studies the natural life in artificial environments, e.g. through simulations using computer models, in order to understand such complex systems. Note that Artificial Life is not similar or is not included in Artificial Intelligence field: the last one is mostly related to the perception, cognition and generation of actions, and the former one focuses on evolution, reproduction, morphogenesis and metabolism processes [22]. The following sections discuss some concepts and mechanisms found in biology and nature, and illustrate how they can be combined with holonic rationale to build complex systems behaving in a simple way as it occurs in nature.
Holonic Rationale and Bio-inspiration
,QIUDVWUXFWXUH 7HFKQRORJLHV :LUHOHVVVHQVRU QHWZRUNV5)L'
249
&ROODERUDWLYH FRQWUROSDUDGLJPV +060$66R$
%LRORJLFDOLQVSLUHG 7HFKQLTXHV 6HOIRUJDQL]DWLRQ(PHUJHQW %HKDYLRU6ZDUP LQWHOOLJHQFH
Fig. 3. Engineering of Complex Distributed Adaptive Systems
3 Emergent Behavior in Complex Systems Biology and nature offer a plenty of powerful mechanisms, refined by millions of years of evolution, to handle emergent and evolvable environments. In nature, almost everything is distributed, being complex systems built upon entities that exhibit simple behaviors and have reduced cognitive abilities, where a small number of rules can generate systems of surprising complexity [23]. As example, an ant or a bee present very simple behavior but their colonies exhibit a smart and complex behavior. The emergence concept reflects this phenomenon and defines the way complex systems arise out from a multiplicity of interactions among entities exhibiting simple behavior. The emergent behavior considers a two-level structure, which have close interdependencies: • Macro level, considering the system as a whole, and being the global patterns of organization resulted from the lower-level interactions. • Micro level, considering the system from the point of view of the local components and their interactions. The emergent behavior occurs without the guidance of a central entity and only when the resulted behavior of the whole is greater and much more complex than the sum of the behaviors of its parts [23]. In the manufacturing domain a typical example of an emergent behavior is a robot able to perform pick and place operations as result of the aggregation of a robot that is able to make movements on the space and grippers that are able to be opened/closed. Broadly, in the characterization of the emergent structures, patterns and properties, three aspects should be considered:
250
P. Leitao
• More than the sum of effects, which mean that the emergent properties are not just the predictable result of summing the properties of the individual parts. • Supervenience, which means that emergent properties are novel, additional or unexpected, which will no longer exist if the micro level is removed (i.e. the emergent properties are irreducible to properties of the micro level). • Causality, which means that the macro level properties should have causal effects on the micro level ones, known as downward causation (i.e. emergent properties are not epiphenomenal). During the operation of a system exhibiting emergent behavior, a large number of non-linear interactions occur among the individual entities, leading to a whole behavior that is complex and difficult to predict, due to the large number of possible non-deterministic ways the system can behave. In spite of being unpredictable, when handling with emergent behavior it is desirable to ensure that the expected properties will actually emerge, and the not expected and not desired properties will not emerge. Swarm intelligence, a concept found in colonies of insects, exhibits this emergent behavior, being defined as the emergent collective intelligence of groups of simple and single entities [24]. In fact, swarm intelligence is typically made up of a community of simple entities, following very simple rules, interacting locally with each another and with their environment. This bottom-up approach offers an alternative way of designing intelligent systems, in which the traditional centralized pre-programmed control is replaced by a distributed functioning where the interactions between such individuals lead to the emergence of "intelligent" global behavior, unknown to them [24]. Examples of swarm intelligence include ant colonies, bird flocking, fish shoaling and bacterial growth [13]. A more widespread example of the application of the swarm intelligence principles is Wikipedia: a huge number of people contribute for the encyclopedia with their individual knowledge; no single person knows everything but collectively it is possible to know far more than it was expected to know. In these environments, swarm intelligence can be achieved more from the coordination of activities and less from the use of decision-making mechanisms. A typical example is the movement of group of birds, where individuals coordinate their movements according to the movement of the others. For this purpose, simple mechanisms are used to coordinate the individual behavior aiming to achieve the global one: the resulted structure is essentially a highly nonlinear configuration (i.e. many to many interactions), where feedback processes (both positive and negative) interacts. The positive and negative feedbacks assume crucial importance in these systems: in case of positive feedback, the system responds to the perturbation in the same direction as the perturbation, and in case of negative feedback, the system responds to the perturbation in the opposite direction. Translating these ideas to the manufacturing world, manufacturing systems can be seen as a community of autonomous and cooperative entities, the holons in the HMS paradigm, each one regulated by a small number of simple rules and representing a manufacturing component, such as a robot, a conveyor, a pallet or an order. The degree of complexity of the behavior of each entity is strongly dependent of the embodied intelligence and learning skills. Very complex and adaptive systems can emerge from the interaction between the individual entities, as illustrated in Fig. 4.
Holonic Rationale and Bio-inspiration
251
p1 t1 p1
p2
p4 p8
p4 t1
t2
t3
p3
p5
t4 p9
p2
p5
t2
t5
p3
p6
t4 p7 t3
t6
p1 t1 t5
p2
p4
t2
t3
p3
p5
p2
t4
t2
t3
p3
p5
p1 t1
p6
p4
t4
p1
t1
p2
t2
Fig. 4. Emergence in Complex Systems
The achieved emergent behavior results from the capability of individual entities to change dynamically and autonomously their properties, coordinated towards a unique goal to evolve. In fact, even if all individuals perform their tasks, the sum of their activities could be the chaos (disorder) if they are not coordinated according to a common goal [25]. Also, the emergent behavior won’t be smart if the members belonging to the micro level imitate one another or wait for someone to tell what to do. Since no one is in charge (i.e. no central control), each member should do its own part being its role important for the whole. In manufacturing, the coordination of these systems, for example for the task allocation, is usually related to the regulation of expectations of entities presenting conflict of interests: some entities (usually products or orders) have operations to be executed and others (usually resources) have skills to execute them. Several algorithms can be used for this purpose, namely those based on the Contract Net Protocol (CNP) [26], those based on the markets laws [27] and those based on the attraction fields concept [28]. Systems exhibiting the emergent behavior, as observed in nature, operate in a very flexible and robust way [24]: • Flexible, since it allows the adaptation to changing environments by adding, removing or modifying the entities on the fly, i.e. without the need to stop, reprogram and re-initialize the other components. • Robust, since the society of entities has the ability to work even if some individuals may fail to perform their tasks. The achievement of emergent systems guarantees the fulfillment of flexible and robustness requirements. A step head in designing these complex adaptive systems is related to how the system can evolve to adapt quickly and efficiently to the environment volatility, addressing the responsiveness property.
252
P. Leitao
4 Evolution and Self-organization in Complex Systems Evolution is the process of change, namely development, formation or growth, over generations, leading to a more advanced or complex form. The evolution phenomenon is observed in several domains such as biology, mathematics and economics. In biological systems there are two different approaches for the adaptation to the dynamic evolution of the environment [29]: evolutionary systems and self-organization. The next sections discuss the concepts of evolution and self-organization to answer the question of how complex systems can be adaptive and evolvable. 4.1 Evolutionary Theory The evolutionary approach derives from the theory of evolution introduced by Charles Darwin 150 years ago in his book “The origin of species” [30]. According to Darwin, nature is not immutable, but on contrary, is in a state of permanent transformation, a continuous movement in which the species would change from generation to generation, evolving to suit their environment. The mechanism of evolution proposed by Darwin, the natural selection, is based on the following points: • Since the populations tend to produce more descendents that those will survive, individuals of a given population will struggle for survival (for food, space and other environmental factors). • Individuals which have more favorable characteristics (i.e. more suitable for conditions in which they are) live longer and reproduce themselves more and, as such, their characteristics are passed to the next generation. On contrary, individuals which do not have advantageous features will be progressively eliminated. Only the most fitness will survive. • The differentiated reproduction allows, through a slow accumulation of characteristics, the emergence of new species, with specific features being retained or eliminated depending on the goal or intention. Basically, Darwin saw the evolution as a result of selection by the environment acting on a population of organisms competing for resources. He stated that the species that will survive to evolution and changes in the environment are not the strongest or the most intelligent, but those that are more responsive to change. In this evolution process, the selection is natural in the sense that is purely spontaneous without a predefined plan. The punctuated equilibrium introduced by Stephen Jay Gould and Niles Eldredge [31] constitutes an advance in the Darwin evolution theory. In opposite to the Darwinian theory of evolution, where the evolution is a slow, continuous process without sudden jumps, the evolution in punctuated equilibrium tends to be characterized by long periods where nothing changed, "punctuated" by episodes of very fast development of new forms. Lately, Darwinian natural selection was combined with Mendelian inheritance (i.e. set of principles relating to the transmission of hereditary characteristics from parent to their children) to form the modern evolutionary synthesis, connecting the units of evolution (genes) with the mechanisms of evolution (natural selection).
Holonic Rationale and Bio-inspiration
253
Translating these theories to the manufacturing world, the companies better prepared to survive in the current competitive markets are those that better respond to emergent and volatile environments, by adapting dynamically their behavior [8]. The complex manufacturing systems should evolve continuously or punctually, driven by stimulus that force its re-organization and adaptation to environmental conditions. The distributed entities are subject to the application of evolutionary techniques, belonging to evolutionary computing, by selecting gradually a better system. Evolutionary computing is a class of computational techniques that uses the Darwinian principles of biological evolution and natural selection to solve complex problems, namely combinatorial optimization problems. The evolutionary algorithms (e.g. genetic algorithms, evolutionary strategies and genetic programming) and the swarm intelligence (and concretely the Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO) algorithms designed using the swarm principles) are evolutionary techniques. In a similar way, neural networks algorithm is a typical example of computer programming techniques that uses the insights of the central nervous system of animals. Here the focus is more on learning than on natural selection. 4.2 Self-organization In biological systems, self-organization plays an important role to achieve system’s adaptation to the dynamic evolution of the environment. Self-organization is basically a process of evolution where the effect of the environment is minimal, i.e. where the development of novel, complex structures takes place primarily through the system itself (the term “self” suggests that), being normally triggered by internal variation processes, which are usually called "fluctuations" or "forces". Self-organization is not a new concept, being applied in different domains such as economics, sociology, computing and robotics. Several distinct definitions, but not necessarily contradictory, are found in the literature (e.g. see [32] and the references therein). However, to better understand the self-organization phenomenon it is necessary to identify some important types of self-organization observed in the nature [33]: stimergy, decrease of entropy and autopoiesis.
$FWLRQ GHSRVLW UHLQIRUFHPHQW
$FWLRQ GHSRVLW UHLQIRUFHPHQW
3HUFHSWLRQ
3HUFHSWLRQ
SKHURPRQH
IORZILHOGJUDGLHQW
IORZILHOGJUDGLHQW
Fig. 5. Ant-based Interaction (adapted from [34])
254
P. Leitao
Stigmergy, a phenomenon found in the social insects’ behavior, is a form of selforganization, involving an indirect coordination between entities, where the trace left in the environment stimulates the execution of a subsequent action, by the same or different entity. As an example, ants exchange information by depositing pheromones (i.e. chemical substances that spread odor) on their way back to the nest when they have found food, as illustrated in Fig. 5. The flow field gradient is characterized by the reduction of the intensity of the odor and increase of the entropy (i.e. reduction of information). This form of self-organization produces complex, intelligent and efficient collaboration, without the need for planning, control or even communication between the entities. In the field of thermodynamics, systems that are constantly exchanging energy (i.e. receiving, transforming and dissipating) exhibit a self-organizing behavior far from thermodynamic equilibrium, decreasing their entropy (i.e. disorder) when an external pressure is applied, e.g. temperature, and thus reaching a new stable state [35]. In fact, the 2nd law of thermodynamics introduces the concept of entropy (from the Greek entrope, meaning change), stating that everything in the universe seemed to tend to go from states of order towards states of chaos (in other words, the quality of the energy is degradated irreversibility). This explains why ice pieces tend towards the room temperature without some outside energy source (e.g. a refrigerator). In the nature of living systems, autopoiesis, which literally means self-creation, represents the capability of a system to self-maintain through the self-generation of the system’s entities (i.e. transformation and destruction), for example through the cells reproduction [36]. Autopoietic systems are closed systems characterized by a changing structure but an invariant organization. The self-organization appears as result of the interaction of the system’s entities that self-(re)generate system’s entities to survive in their given environment. Analyzing the above types of self-organization found in nature, in spite of the similarity of concepts observed, some differences can be identified. As an example, comparing the stimergy behavior with the decrease of entropy, it is possible to observe an interesting difference in the trigger mechanism: in stimergy, the initiative occurs within the system (the pheromones deposited by the system members) and in the decrease of entropy, the initiative occurs as response to external pressures (e.g. an increase/decrease of the temperature). Having in mind the previously described types of self-organization, a possible definition, used in this work, is to consider self-organization as the ability of an entity/system to adapt dynamically its behavior to external conditions by re-organizing its structure through the modification of the relationships among entities without external intervention and driven by local and global forces. The integration of self-organization capabilities may require using distributed architectures [37], as those provided by the holonic design, which do not follow a rigid and predictable organization. In fact, autonomous systems, as our brain, have to constantly optimize their behavior, involving the combination of non-linear and dynamic processes. The self-organizing behavior of each entity is based on an adaptive mechanism that dynamically interprets and responds to perturbations [37]. These characteristics imply the management and control of behavioral complexity as well. However, it is important to notice that the system re-organization may also occur in structures with centralized control.
Holonic Rationale and Bio-inspiration
255
Evolution and self-organization can be confusing concepts. Some authors, such as Stuart Kauffman, state that natural selection must be complemented by selforganization in order to explain evolution [38]. In the Darwin’s theory of evolution, the evolution is the result of the selection by the environment, acting on a population of organisms competing for resources, i.e. encompassing external forces, while in self-organization, the evolution is purely due to the internal configuration without any external forces or pressures. Translating the self-organization concept found in nature to the manufacturing domain, the network of entities that represents the control system is established by the entities themselves. Ideally, re-configuration should be done on the fly, maintaining unchanged the behavior of the entire system which should continue to run smoothly after the change, as illustrated in Fig. 6.
Fig. 6. Evolution in Manufacturing Complex Adaptive Systems
In manufacturing domain, the need for re-configuration and evolution can appear in several situations. Particularly, self-organization can contribute to design adaptive manufacturing systems in the main following areas [28]: • Shop floor layout, where the manufacturing resources present in the shop floor are movable, i.e. the producer and transporter resources move physically in order to minimize the transportation distances. • Adaptive control, where the goal is to find out an adaptive and dynamic production control strategy based in the dynamic and on-line schedule, adapted in case of occurrence of unexpected disturbances. • Product demand, where the manufacturing system re-organizes itself in order to adapt to the changes in the product demand, increasing or reducing the number of manufacturing resources, or modifying their capabilities. Others manufacturing related domains can also be referred, such as supply chain optimization, virtual organizations management and logistics management, which requires the frequent re-organization of partners aiming to achieve optimization and responsiveness to unexpected situations. Driving forces guide the re-organization process according to the environment conditions and to the control properties of the distributed entities. Several selforganization mechanisms can be found in nature [39]: foraging, nest building, molding
256
P. Leitao
and aggregation, morphogenesis, web weaving, brood sorting, flocking and quorum. Each one of these mechanisms presents different driven forces to support the evolution. For example, in foraging the driving force is the deposit of pheromones, and in flocking and schooling the self-organization is driven by collision avoidance, speed matching and flock centering. The embodied intelligence, a concept used in the Artificial Life field, may play another important role in the design of these systems. Embodied intelligence suggests that intelligence requires a body to interact with [40], with the intelligent behavior emerging from the interaction of brain, body and environment. The key issue is to define powerful intelligence mechanisms, not only including static intelligence mechanisms but also learning capabilities, that enable the system to behave better in the future as the result of its experience and knowledge. The learning capability is not just a human (which comprises around of 50 billions of neurons) or higher animal prerogative, but it also occurs in worms (comprising a simple 302-cell nervous system) and even in single-celled bacteria. The degree of efficiency of the selforganization capability, and consequently the improvement of the entity’s performance to dynamically evolve in case of emergency, is strongly dependent on how the learning mechanisms are implemented. Learning capabilities play an important role in the evolution process, e.g. the identification of re-configuration opportunities and the way the system evolves. 4.3 Equilibrium and Stability in the Evolution Process In dynamic and complex systems, in which emergence and evolution play key roles, besides to consider mechanisms to identify the reconfiguration and evolution opportunities, it is important to discuss the equilibrium, stability, predictability during the evolution process. The evolution and equilibrium are apparently contradictory concepts: evolution is related to how things change over the time and equilibrium is related to how things attain a steady and balanced state of being. Systems exhibiting evolution features imply non equilibrium dynamics, and in certain situations, specific driving forces will force the system far away from equilibrium. However, equilibrium is different from stability. The concept of stability is concerned to the condition in which a slight disturbance or modification in the system does not produce a significant disrupting effect on that system. This is especially important in chaotic and nonlinear systems, where a small perturbation may cause a large effect (see the butterfly effect3), and consequently resulting in the system becoming unstable. In evolution processes, like those resulting from the emergent behavior, positive and negative feedbacks are crucial to fuel them: the first ones to increase the number of configurations and the second ones to stabilize these configurations. The interaction between them may create intrincate and unpredictable patterns (chaos), which can develop very quickly until a stable configuration (known as attractor), according to a goal or objective. On the other hand, during the evolution process some instability and unpredictability can appear as the result of not properly synchronized evolution processes. 3
Introduced by Edward Lorentz to illustrate the notion of sensitive dependence on initial conditions in chaos theory: a butterfly flapping its wings in one part of the world (e.g. in Chicago) can contribute to the evolution of a tornado in another part of the world (e.g. in Tokyo).
Holonic Rationale and Bio-inspiration
257
The phenomenon of emergence is associated to the tendency of systems to create order from chaos (concept known as extropy in opposite to entropy). In such dissipative systems, the system self-organize into an ordered state since this actually increases the rate of entropy production (as greater is the energy that flows in such systems as greater is the order generated). In fact, a self-organizing system which decreases its entropy must necessarily, in analogy to the 2nd law of thermodynamics, dissipate such entropy to its surroundings. Note that the order can also be regarded as the quantity of information available. Regulation mechanisms are crucial to support the emergence from chaos to order during the evolution process, achieving stability and avoiding the increase of entropy and consequently the chaotic or instable states. Although being chaotic and unpredictable, evolution moves preferentially in the direction of increasing a fitness objective, which depends of the system context and strategy: e.g. the objective can be to reduce the thermo-dynamical energy or to increase the system’s productivity. The evolution process, i.e. the achieved organization, must be evaluated, according to a specific criterion, if the achieved organization solution is better than the previous one [41]. In the complex systems described in the paper, each individual has partial knowledge, i.e. none of them has a global view of the system, introducing uncertainty in the system. Being the entropy a measure of uncertainty of information or ignorance, these systems are normally associated to disorder (chaos). Using mechanisms that combine efficiently the individual knowledge hosted by distributed entities, it is possible to reduce the entropy of the system, becoming the system complex, organized and ordered. Trust-based systems and reputation systems, that take their inspiration from human behavior, can be suitable to handle information uncertainty and be associated to emergence and re-organization processes.
5 Combining Emergence and Self-organization Concepts Emergence and evolution, especially the self-organization, are two different concepts usually incorrectly referred as synonyms in the literature. In spite of their similarities, the difference between self-organization and emergence should be stated: they both lead to systems that evolve over time and can not be directly controlled from the exterior, but while emergent systems consist of a set of individuals that collaborate to exhibit a higher behavior, the self-organized systems exhibit a goal-directed behavior. Additionally, they both exhibit robustness properties but in a different manner [42]: • In the emergence concept it is related to the flexibility of local components that contribute to the emergent properties (i.e. the failure of one component will not result in the complete failure of the emergent property). • In the self-organization concept it is related to the capability of dynamically adapt to change. The emergent behavior and the evolution capability can be exhibited independently or combined. This emergence/self-organization relationship can be expressed in the bidimensional approach to complex evolvable systems illustrated in Fig. 7.
P. Leitao
HYROXWLRQ
258
Fig. 7. Bi-Dimensional Approach to Complex Evolvable Systems
The traditional central and rigid control systems are characterized by not exhibiting self-organization and emergent behavior, and consequently they are not able to re-organize to adapt to environmental changes. These systems are not sufficient to respond to the current demands for flexibility, responsiveness and re-configurability. The self-organization appears without having emergence, the so-called evolvable systems of Fig. 7, essentially when the system works under central or strictly hierarchical control. In fact, new structure patterns can be identified and adopted under the central control to respond to changes. On the other hand, it is also possible to build systems exhibiting the emergent phenomenon without having self-organization, the so-called emergent systems in Fig. 7. In this case, the emergent behavior appears as result from the interaction between distributed entities but the whole system is unable to self-organize to face changes. The most interesting systems are those exhibiting simultaneously self-organization and emergence behavior. Here, the system works under decentralized control emerged from the interactions among individual entities, which are autonomous, active and responsiveness to change, leading to the dynamic system self-organization. The application of self-organization associated to emergent behavior allows achieving [32]: • Dynamic self-configuration, i.e. the adaptation to changing conditions by changing their own configuration permitting the addition/removal of resources on the fly and without service disruption. • Self-optimization, i.e. tuning itself in a pro-active way to respond to environmental stimuli. • Self-healing, i.e. the capacity to diagnose deviations from normal conditions and to take proactive actions to normalize them and avoid service disruptions. The holonic and multi-agent applications, according to the bi-dimensional approach of Fig. 7, normally address the first dimension (i.e. the emergent behavior) but rarely consider the second one (i.e. the evolution and self-organization). For example, self-organization in multi-agent systems normally occurs according to the swarm intelligence principles using very simple agents and interactions rules.
Holonic Rationale and Bio-inspiration
259
In these emergent and evolvable environments, as self-organizing holonic systems are, a pertinent question is related to how emergent behavior and self-organization can be combined. According to Wolf and Holvoet, two different perspectives can be considered [42]: • Self-organization as the cause, being the emergent behavior as result from the self-organization of the interactions among components; in this case selforganization is situated at the micro-level of the emergent process (i.e. selforganization leads to emergence). • Self-organization as the effect, being achieved as a consequence of the emergent behavior (i.e. is an emergent property); in this case self-organizing behavior occurs at the macro-level (i.e. emergence leads to self-organization). However, an additional possibility is to have the self-organization disaggregated from the emergent behavior. In this case, self-organization is inserted on the top of emergent behavior, like a cherry on top of the cake, appearing not only from the self-organization of the interactions among components but also from the self-organization exhibited by the behaviors of local components and the mechanisms that drive these local selforganization capabilities. In some situations, the emergence of complex adaptive systems requires the reproduction of their members, aiming to evolve over the time to reduce its fitness, recalling the theory of evolution and introducing the autocatalytic sets theory. An autocatalytic set is a group of elements that catalyses the creation of its own elements, being in biology referred as the generation of offspring. The autocatalytic set is a system characterized by positive feedback, i.e. the presence of its members increases the rate at which new set elements are created, which in turn increases this rate even further. An undesirable consequence of positive feedback is that the system becomes locked in the solution that it selects first and a huge effort is required to switch at a later instant. Here some similarities can be found between the Darwin’s theory and Simon’s observations: life forms are not created from scratch, but instead they create small sets of structures that are catalysis to themselves and sufficiently stable to survive until the next energy input (i.e. the auto-catalytic sets). The autocatalysis process requires the existence of the autonomy property in the members, and its improvement requires the existence of cooperation. Recalling the holonic rationale it is possible to verify that the notion of holon already consider the autonomy and cooperation as important properties of complex adaptive systems. In holonic rationale, due to the concepts of holons and holarchies, self-organization occurs as an emergent process where order appears from disorder due to simple relations that statistically evolve through complex relations progressively organizing themselves [33]. More powerful self-organizing holonic systems can be devised using more intelligent agents and more complex interaction patterns between local components. The current challenge faced to the research community is to research the combination of self-organization mechanisms with emergent behavior that enhance the holonic rationale aiming to achieve emergent and evolvable complex systems.
260
P. Leitao
6 The ADACOR Example In manufacturing domain some few examples can be referred as tentative of introduction of biological inspired insights. Namely, Valckenaers et al. combined stimergy concepts with the PROSA architecture to achieve emergent forecasting in manufacturing coordination and control [43], Parunak and Brueckner use stimergic learning to achieve self-organization in mobile ad-hoc networks [34], and Ulieru et al. use emergence concepts to cover both vertical and horizontal integration in distributed organizations to enable the dynamic creation, refinement and optimization [44]. This section describes the use of concepts inherited from biology, namely swarm intelligence and self-organization, in the ADACOR architecture, to achieve an adaptive production control approach that addresses the system re-configurability and evolution, especially when operating in emergent environments. ADACOR architecture is built upon a community of autonomous and cooperative holons representing manufacturing entities, e.g. robots, pallets and orders. In analogy with insect colonies, where an individual usually does not perform all tasks, but rather specializes in a set of tasks [24] (a concept known in biology as division of labour), ADACOR architecture identifies four manufacturing holon classes, each one possessing proper roles, objectives and behaviors [8]: product (PH), task (TH), operational (OH) and supervisor (SH). The product holons represent the products available in the factory catalogue, the task holons represent the production orders launched to the shop floor to execute the requested products and the operational holons represent the physical resources available at shop floor. Supervisor holons provide co-ordination and optimization services to the group of holons under their supervision, and thus introducing hierarchy in a decentralized system. The modularity provided by ADACOR is similar to that exhibited by the Lego™ concept: grouping elementary and inter-connectable entities in a particular way allow building bigger and more complex systems. Emergent behavior emerges from the interactions between ADACOR holons exhibiting intelligent behavior. The systems’ re-configurability or evolution is achieved by the dynamic re-aggregation of the elementary components or systems. Being the ADACOR holons pluggable (i.e. without the need to re-initialize and re-program the system when a holon is added to the system), it offers enormous flexibility and re-configurability to support emergent behavior on the fly. 6.1 Driving Forces for Self-organization The system self-organization is only achieved if the distributed entities have stimulus to drive their local self-organization capabilities. In ADACOR, the local driving forces to achieve self-organization are the autonomy factor and the learning capability, which are inherent characteristics to each ADACOR holon. The autonomy factor, α, is a parameter that fixes the level of autonomy of each ADACOR holon [8], and evolves dynamically in order to adapt the holon behavior to the changes in the environment where it is placed. The autonomy factor is regulated by a function, α = f (α, τ, ρ), where:
Holonic Rationale and Bio-inspiration
261
• τ is the reestablishment time, which is the estimated time to recover from the disturbance. • ρ is the pheromone parameter, which is an indication of the level of impact of the disturbance. The evolution into a new organization, triggered by the rules described above, is governed by a decision mechanism where learning mechanisms play a crucial role to detect evolution opportunities and ways to evolve. The powerfulness of the selforganization mechanism is closely related on how the learning mechanisms are implemented and on new knowledge influences the decision parameters. These two local driving forces (i.e. autonomy and learning) allow the dynamic selfadaptation of the holon, contributing for the re-configuration of the system as a whole. However, the global self-organization of the system is only achieved if global forces drive the local self-organization capabilities. The global driving force used in ADACOR to support the system’s self-organization is a pheromone-like spreading mechanism, recalling the stimergy concept. The holons cooperating with this type of mechanism propagate the need for re-organization by spreading the information to the other holons, like ants deposit pheromones in the environment. The quantity of pheromone deposited in the neighbor supervisor holon is proportional to the forecasted impact of the disturbance. The holons associated to each supervisor holon sense the information dissipated by the other holons (like ants sense the pheromone odors), and accordingly, they trigger a self-adaptation of their behavior (e.g. increasing its autonomy) and propagate the pheromone to other neighbor holons. The intensity of the pheromone odor becomes smaller as far as it is from the epicenter of the evolution trigger (similar to distance in the original pheromone techniques), according to a defined flow field gradient. The use of pheromone-like techniques for the propagation of information is suitable for the dynamic and continuous adaptation of the system to disturbances, supporting the global self-organization and reducing the communication overhead [8]. A simple implementation of the decision function associated to the autonomy factor can use a fuzzy rule-based engine that considers a simple discrete binary variable for the autonomy factor, comprising the states {Low, High}, and a discrete variable for the pheromone parameter, comprising the states {Very Low, Low, Medium, High, Very High}. In this case, the evolution of the autonomy factor is determined by the following set of simplified rules [32]: IF (ρ >= HIGH AND α == LOW) THEN α:= HIGH AND evolveIntoNewStructure IF (ρ >= HIGH AND α == HIGH AND τ == ELAPSED) THEN α:= HIGH AND τ:= value IF (ρ <= LOW AND α == HIGH AND τ == ELAPSED) THEN α:= LOW AND evolveIntoNewStructure
Briefly, when the operational holons have a {Low} autonomy factor, they are aware to accept the proposals sent by the supervisor holons. The identification of an opportunity to evolve, for example the arrival of a rush order or the breakdown of a machine, represented by the {Medium, High, Very High} values associated to the pheromone parameter, according to the distance to the epicenter where the disturbance occurred,
262
P. Leitao
triggers the change of the autonomy factor to {High} and a re-organization process. Being the autonomy factor {High} when the reestablishment time has elapsed, if the pheromone is still active, which means that the disturbance is not completely recovered, the action triggered is to reinforce the reestablishment time. If the pheromone has already dissipated, which means that the disturbance is already solved, the holon change the autonomy factor to {Low}. 6.2 Adaptive Production Control Mechanism Based on Self-organization The ADACOR adaptive production control, illustrated in Fig. 8, evolves in time by balancing between a more centralized approach and a more flat approach, passing through other intermediate forms of control [8], due to the described self-organization mechanisms. 6WHDG\VWDWH 6+
7+
LQWHUDFWLRQGXULQJWKH UHVRXUFHDOORFDWLRQ
6+
RFXUUHQFHRID PDFKLQHIDLOXUH
6+
3URSDJDWLRQRIWKH HPHUJHQFH
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
2+
2+
2+
2+
2+
6+
7+
SURSDJDWHVWKH SKHURPRQH
SURSDJDWHVWKH SKHURPRQH
6+ HYROXWLRQ UHRUJDQL]DWLRQLQWRD KLHUDUFKLFDOVWUXFWXUHDIWHU WKHGLVWXUEDQFHZDV UHFRYHU\
7UDQVLHQWVWDWH
7+
6+ GHSRVLWD SKHURPRQH
ORJLFDO FRQWURO
ORJLFDO FRQWURO
2+
2+
ORJLFDO FRQWURO
2+ RFFXUUHQFHRI DGLVWXUEDQFH
WKHKRORQV VHQVHWKH SKHURPRQH ORJLFDO FRQWURO
ORJLFDO FRQWURO
2+
2+
6+
7+
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
ORJLFDO FRQWURO
2+
2+
2+
2+
2+
HYROXWLRQ UHRUJDQLVDWLRQLQWRD KHWHUDUFKLFDOVWUXFWXUH
Fig. 8. Adaptive Production Control based on a Self-organization Mechanism [8]
Briefly, in the stationary operation, when the objective is the optimization, the holons are organized in a hierarchical structure, with the presence of supervisor holons, acting as coordinating entities. Supervisor holons elaborate, periodically, optimized schedules that are proposed to the operational holons. Although having enough autonomy to accept or reject the proposals, the operational holons, due to their low autonomy factors, see those proposals as advices and follow them.
Holonic Rationale and Bio-inspiration
263
When an unexpected disturbance is detected, for example a breakdown in a milling center, that implies a deviation from the plan, the system enters in a transient state and is forced to evolve to a heterarchical structure, operating without the presence of coordination levels, with each holon increasing its autonomy and assuming the complete control of its own activity. This re-organization is supported by the self-adaptation of each holon and the self-organization of the system propagated using the pheromone based technique. In this stage, the scheduling is achieved in a distributed manner, resulting solely from the interaction between task and operational holons. The supervisor holons can continue elaborating and proposing schedules to the operational holons, but since they have now high values of autonomy, they will probably reject the proposals. The holons remain in the transient state during the reestablishment time, which desirably should be as short as possible. After the recovery from the disturbance, the reinforcement of the pheromone is terminated, and the other holons don’t sense anymore the pheromone and reduce again their autonomy factors. The system, as a whole, evolves now into a new control structure, which can be the previous one or a new one, according to the learning mechanisms embedded in each holon. The described approach was implemented using the JADE (Java Agent Development Framework) multi-agent framework and applied to a flexible manufacturing system [45]. The achieved experimental results, see [45] for more details, allowed concluding the applicability and benefits of the described adaptive production control approach, namely the improvement of some quantitative parameters, such as the throughput, and also some quality parameters, such as the system’s agility. The experimental results, in spite of being promising, show the need to apply more powerful reconfigurable and self-organizing mechanisms. In fact, the evolvable mechanisms should be more continuous, unpredictable and self-regulated based on more powerful self-organization techniques combined with the embodiment of more learning capabilities in the distributed entities.
7 Conclusions Aiming to address the emergent requirements imposed to the manufacturing domain, a current challenge to design re-configurable and responsiveness manufacturing systems is to consider the holonic design combined with the insights offered by biological inspired techniques, namely swarm intelligence and self-organization. Understanding how in nature the complex things are performed in a simple and effective way, allows us to copy and develop complex and powerful adaptive and evolvable systems. Additionally, these concepts can be combined with insights from emergent theories from computer science, namely the artificial life and evolutionary computing. The paper makes an incursion in these biological inspired techniques, namely the emergent behavior and self-organization concepts, usually associated with complex and non-linear phenomena. It was pointed out that the interaction between simple individuals may create intricate and unpredictable patterns (chaos), which can evolve until they reach a stable configuration (order), i.e. emerging from chaos to order and being the basis for self-evolvable systems. A bi-dimensional approach was introduced to co-relate
264
P. Leitao
emergent behavior and evolution (and especially self-organization), discussing the use of systems that exhibit each concept separately and those that combine them. The ADACOR holonic approach was used to illustrate the benefits resulted from the application of some biological inspired theories, namely emergence and selforganization, to achieve an adaptive production control that evolves from centralized structures when the objective is the optimization, to more flat structures in presence of unexpected scenarios. In spite of the differences of several theories to understand and explain complex systems in nature, future research should remix science and consider the good insights of each one, combining them to build powerful mechanisms that exhibit emergent and evolvable behavior in complex adaptive systems.
References 1. ElMaraghy, H.: Flexible and Reconfigurable Manufacturing Systems Paradigms. International Journal of Flexible Manufacturing Systems 17, 261–271 (2006) 2. Manufuture, N.N.: A Vision for 2020, Assuring the Future of Manufacturing in Europe. Report of the High-level Group, European Commission (2004) 3. Wooldridge, M.: An Introduction to Multi-Agent Systems. John Wiley & Sons, Chichester (2002) 4. Minsky, M.: The Society of Mind. Heinemann (1985) 5. Okino, N.: Bionic Manufacturing System. In: Peklenik, J. (ed.) CIRP Flexible Manufacturing Systems Past-Present-Future, pp. 73–95 (1993) 6. Brussel, H., Van, W.J., Valckenaers, P., Bongaerts, L., Peeters, P.: Reference Architecture for Holonic Manufacturing Systems: PROSA. Computers in Industry 37, 255–274 (1998) 7. Sousa, P., Silva, N., Heikkilä, T., Kollingbaum, M., Valckenaers, P.: Aspects of Cooperation in Distributed Manufacturing Systems. In: Proceedings of the Second International Workshop on Intelligent Manufacturing Systems, pp. 695–717 (1999) 8. Leitão, P., Restivo, F.: ADACOR: A Holonic Architecture for Agile and Adaptive Manufacturing Control. Computers in Industry 57(2), 121–130 (2006) 9. Chirn, J.-L., McFarlane, D.: A Holonic Component-Based Approach to Reconfigurable Manufacturing Control Architecture. In: Proceedings of the International Workshop on HoloMAS, pp. 219–223 (2000) 10. Schild, K., Bussmann, S.: Self-organization in Manufacturing Operations. Communications of the ACM 50(12), 74–79 (2007) 11. Leitão, P.: Agent-based Distributed Manufacturing Control: A State-of-the-art Survey. To appear in the Engineering Applications of Artificial Intelligence (2009) 12. Marik, V., Lazansky, J.: Industrial Applications of Agents Technologies. Control Engineering Practice 15, 1364–1380 (2007) 13. Miller, P.: The Genius of Swarms. National Geographic (July 2007) 14. N.N.: Visionary Manufacturing Challenges for 2020. Committee on Visionary Manufacturing. National Academic Press, Washington (1998) 15. Valckenaers, P., Van Brussel, H., Holvoet, T.: Fundamentals of Holonic Systems and their Implications for Self-adaptive and Selforganizing Systems. In: Proceedings of the Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2008), Venice, Italy, October 20-24 (2008) 16. Koestler, A.: The Ghost in the Machine. Arkana Books, London (1969) 17. Simon, H.: The Sciences of the Artificial, 6th edn. MIT Press, Cambridge (1990)
Holonic Rationale and Bio-inspiration
265
18. Winkler, M., Mey, M.: Holonic Manufacturing Systems. European Production Engineering (1994) 19. Marík, V., Fletcher, M., Pechoucek, M.: Holons & Agents: Recent Developments and Mutual Impacts. In: Mařík, V., Štěpánková, O., Krautwurmová, H., Luck, M. (eds.) ACAI 2001, EASSS 2001, AEMAS 2001, and HoloMAS 2001. LNCS, vol. 2322, pp. 233–267. Springer, Heidelberg (2002) 20. Marik, V., McFarlane, D.: Industrial Adoption of Agent-based Technologies. IEEE Intelligent Systems 20(1), 27–35 (2005) 21. Adami, C.: Introduction to Artificial Life. Springer, Heidelberg (1998) 22. Brooks, R.: The Relationship between Matter and Life. Nature 409, 409–411 (2001) 23. Holland, J.: Emergence: from Chaos to Order. Oxford University Press, Oxford (1998) 24. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: from Natural to Artificial Systems. Oxford University Press, Oxford (1999) 25. Thamarajah, A.: A Self-organizing Model for Scheduling Distributed Autonomous Manufacturing Systems. Cybernetics Systems 29(5), 461–480 (1998) 26. Smith, R.: Contract Net Protocol: High-Level Communication and Control in a Distributed Solver. IEEE Transactions on Computers, C-29(12), 1104–1113 (1980) 27. Markus, A., Vancza, T.K., Monostori, L.: A Market Approach to Holonic Manufacturing. Annals of CIRP 45, 433–436 (1996) 28. Vaario, J., Ueda, K.: Biological Concept of Self-organization for Dynamic Shop Floor Configuration. In: Proceedings of Advanced Product Management Systems, pp. 55–66 (1997) 29. Vaario, J., Ueda, K.: Self-Organisation in Manufacturing Systems. In: Japan-USA Symposium on Flexible Automation, Boston, US, pp. 1481–1484 (1996) 30. Darwin, C.: The Origin of Species. Signet Classic (2003) 31. Gould, S., Eldredge, N.: Punctuated Equilibrium Comes of Age. Nature 366, 223–227 (1993) 32. Leitão, P.: A Bio-Inspired Solution for Manufacturing Control Systems. In: Azevedo, A. (ed.) IFIP International Federation for Information Processing, Innovation in Manufacturing Networks, pp. 303–314. Springer, Heidelberg (2008) 33. Di Marzo Serugendo, G., Gleizes, M.-P., Karageorgos, A.: Self-Organisation and Emergence in MAS: An Overview. Informatica 30, 45–54 (2006) 34. Parunak, H.V.D., Brueckner, S.: Entropy and Selforganization in Multi-Agent Systems. In: Proceedings of the International Conference on Autonomous Agents (2001) 35. Glansdorff, P., Prigogine, I.: Thermodynamic Study of Structure, Stability and Fluctuations. Wiley, Chichester (1971) 36. Varela, F.: Principles of Biological Autonomy. Elsevier, New York (1979) 37. Bousbia, S., Trentesaux, D.: Self-organization in Distributed Manufacturing Control: State-of-the-art and Future Trends. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 5 (2002) 38. Kauffman, S.: The Origins of Order: Self Organization and Selection in Evolution. Oxford University Press, New York (1993) 39. Manei, M., Menezes, R., Tolksdorf, R., Zambonelli, F.: Case Studies for Self-organization in Computer Science. Journal of Systems Architecture 52, 443–460 (2006) 40. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge (2001) 41. Pujo, P., Ounnar, F.: Decentralized Control and Self-organization in Flexible Manufacturing Systems. In: Proceedings of the IEEE International Conference on Emergent Technologies for Factory Automation (ETFA 2001), pp. 659–663 (2001)
266
P. Leitao
42. Wolf, T., Holvoet, T.: Emergence vs Self-organization: Different Concepts but Promising When Combined. In: Brueckner, S.A., Di Marzo Serugendo, G., Karageorgos, A., Nagpal, R. (eds.) ESOA 2005. LNCS (LNAI), vol. 3464, pp. 1–15. Springer, Heidelberg (2005) 43. Valckenaers, P., Hadeli, K.M., Brussel, H., Bochmann, O.: Stigmergy in Holonic Manufacturing Systems. Journal of Integrated Computer-Aided Engineering 9(3), 281–289 (2002) 44. Ulieru, M., Unland, R.: A Holonic Self-Organization Approach to the Design of Emergent e-Logistics Infrastructures. In: Di Marzo Serugendo, G., Karageorgos, A., Rana, O.F., Zambonelli, F., et al. (eds.) ESOA 2003. LNCS (LNAI), vol. 2977, pp. 139–156. Springer, Heidelberg (2004) 45. Leitão, P., Restivo, F.: Implementation of a Holonic Control System in a Flexible Manufacturing System. IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Reviews 38(5), 699–709 (2008)
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems Paulo Leitao1, Paul Valckenaers2, and Emmanuel Adam3 1
Polytechnic Institute of Bragança, Campus Sta Apolonia, Apartado 1134, 5301-857 Bragança, Portugal [email protected] 2 K.U. Leuven, Mechanical Engineering Department, Celestijnenlaan 300, B-3001 Leuven, Belgium [email protected] 3 Univ Lille Nord de France, F-59000 Lille, France UVHC, LAMIH, F-59313 Valenciennes, France CNRS, UMR 8530, F-59313 Valenciennes, France [email protected]
Abstract. This paper reflects a discussion at the SARC workshop, held in Venice, October 2008. This workshop addresses robustness and cooperation in holonic multi-agent systems within a context of self-organizing and selfadaptive systems. The paper first presents the basic principles underlying holonic systems. The holonic system reveals itself as a ‘law of the artificial’: in a demanding and dynamic environment, all the larger systems will be holonic. Next, it addresses robustness in holonic systems, including its relationship to self-organization and self-adaptation. These self-* systems indeed are capable of delivering superior robustness. Third, it addresses cooperation in holons and holonic systems, including its relationship with the autonomy of the individual holons. Cooperation imposes constraints on a holon such that its chances of survival and success increase. Keywords: Robustness, Cooperation, Self-Adaptation, Self-organization, Holonic Multi-agent Systems.
1 Introduction Current demands imposed on manufacturing information systems require the adoption of complex, emergent and adaptive system designs, where flexibility, re-configurability and responsiveness play crucial roles. The design of such systems may require inspiration from diverse fields, e.g. complex systems, artificial intelligence, sociology and biology. Holonic multi-agent systems (HMAS) paradigm is a suitable approach to address this challenge, providing an interesting potential regarding self-adaptation and self-organization. Holonic multi-agent systems are pyramidal systems where the notions of robustness and cooperation between the autonomous agents are key issues; note that these properties, and also autonomy, are inherited from the foundations of the underlying A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 267–288, 2009. © Springer-Verlag Berlin Heidelberg 2009
268
P. Leitao, P. Valckenaers, and E. Adam
paradigms, i.e. multi-agent systems [1] and holonic systems [2]. Robustness and the cooperation are strongly dependent of the dynamic of such systems. Robustness enables HMAS to cope with perturbations, faults and other constraints. Its implementation typically is a self-organization challenge of which the regeneration of faulty agents, generation of new agents and/or migration of agents are only some examples. Cooperation between the agents also is essential in this context: as the main objective is to manage the manufacturing system, autonomous agents have to lead the HMAS collectively to its global goal. This property generally is achieved by letting the agents locally and autonomously modify their roles or behaviors, by adapting their strategies to work with other agents; it is a self-adaptation property. Interesting propositions tackling the notion of robustness and cooperation have been proposed in the manufacturing system domain. However, very few of them take advantage of the self-adaptation and self-organization of dynamic systems, like multiagent systems. On the other side, research results on self-organization and emergence propose interesting methods and models to deal with the robustness and the cooperation in dynamic systems, but some of these approaches have identified the need of a multi-level control or a control from a supervisory level. This paper discusses the models and methods linked to robustness and cooperation in dynamic systems, enabling the emergence of new behaviors and sub-organizations in a HMAS context. It also discusses how these properties, i.e. robustness, cooperation and autonomy, are correlated and can constraint each other, and how can support self-adaptation and self-organization in HMAS environments. The rest of the paper is organized as follows: Section 2 overviews the roots of holonic systems and introduces the concept of HMAS. Section 3 discusses the robustness property in HMAS and Section 4 analysis how cooperation and autonomy are handled in HMAS. Section 5 summarizes how robustness, cooperation and autonomy properties are correlated to support self-adaptation and self-organization in HMAS, and rounds up the paper with the conclusions.
2 Holonic Multi-Agent Systems and the Sciences of the Artificial In many publications, holonic systems are defined by their system characteristics such as a pyramidal or a nearly-decomposable structure. This creates the impression that holonic systems merely are a system concept or proposition amongst many others (e.g. fractal, bionic, hierarchical, heterarchical, etc. [3-5]. This creates the illusion that system designers may choose amongst these systems designs/concepts according to their preferences. This section reveals why this assumption is false. In fact, the holonic systems concept originates from Simon’s “Sciences of the Artificial” [6]. In his endeavor, Simon searches for the laws of the artificial, which constitute a counterpart for the laws of physics in natural systems. Such laws are invariably true under the conditions in which they apply. A mechanical engineer ignoring Newton’s laws—under conditions in which they apply—will fail miserably. A system designer who ignores Simon’s laws of the artificial—under conditions in which they apply—will equally fail miserably. Therefore, whenever Simon’s basic assumptions hold, successful systems are holonic systems. If a system has a non-holonic design, it simply will not exist (or cease to exist),
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
269
when and wherever these assumptions apply. Conversely, as Simon restricts the definition of a holonic system to properties that are invariably true, the design space for developers of holonic systems will be large. Indeed, most design choices still need to be made. The remainder of this section thus presents and discusses the roots of the holonic system concept. A key message is that the more complex self-adaptive and selforganizing systems will be holonic. 2.1 Simon’s Assumptions The key starting point of Simon’s laws of the artificial is limited or bounded rationality. Intelligent systems, human or otherwise, have finite computation and communication capacities. There are upper bounds on the speed with which information-processing tasks can be performed, and increasing the number of processors and brains rapidly brings no further improvement because of the entailing communication efforts. Notice that Simon’s starting point contrasts to many other scientific disciplines. Theories about the market economy1, game theory, etc. commonly assume that the ‘homo economicus’ takes perfect decisions based on perfect information. In reality, decisions are made by a bounded rationality based on incomplete and inaccurate information, and executing these decisions is subject to uncertainty (e.g. the uncertainty caused by imperfect law enforcement in an economic system). A second element, leading to the conclusion that a certain class of systems forcibly must be holonic, is a demanding environment. Holonic systems need to answer nontrivial critical requirements in a competitive setting. When a system fails to exploit all the available means to achieve a superior performance, it will disappear (or fail to appear at all). Indeed, other more successful system designs will deprave it from its resources. In particular, these systems must grow larger and more complex as long as this growth increases their performance in this demanding environment. Consequently, the surviving and successful systems are located on the frontier corresponding to the limited rationality of their developers. The developers improve and expand their system until they hit the complexity ceiling. The third element leading to the conclusion that a certain class of systems forcibly must be holonic is a dynamic environment. Holonic systems need to adapt continuously and rapidly. In other words, the limited ability to develop, analyze and optimize a system’s design—caused by limited rationality—applies equally to the need to evolve and adapt to a rapidly changing environment. System developers do not have time to optimize their designs fully since it will be outdated when their system becomes operational. In a dynamic environment, swift adaptation, which results in a suboptimal but competitive performance, is the successful strategy. The system may continuously optimize but never becomes optimal. In addition, the speed at which improvements are implemented is vital. Moreover, the information processing effort needed for improving the current system in view of recent developments in a dynamic environment cannot exceed the prevailing limits of our bounded rationality. Notice that the latter two elements do not belong to the same category as the first. Indeed, limited rationality is omnipresent and invariant. All system development and 1
Simon received the Nobel Prize for Economic Sciences in 1978.
270
P. Leitao, P. Valckenaers, and E. Adam
adaptation processes/effort are subject to upper bounds on information processing and communications. In contrast, there may be systems that are not exposed to a demanding and dynamic environment. Therefore, the claims made in the discussion below only apply to systems for which all three elements are present. In other words, not all systems are necessarily holonic. However, when the environment of a given system is dynamic and demanding, this system inevitably will be holonic. 2.2 Simon’s Pyramidal Structure Simon uses a parable about two watchmakers to reveal why and to which extent systems with a pyramidal structure are superior over more monolithic designs. Please keep in mind that his claim and corresponding argumentations rely on the above assumptions or conditions (see 2.1). Systems composed of subsystems that in turn consist of subsystems (and so on), which is also referred to as a nearly decomposable configuration, are vastly superior when it comes to adaptation in dynamic and demanding environments. These holonic systems are equally better equipped for swift development such that they may emerge within the window of opportunity at which they are targeted. This parable goes as follows: There were once two watchmakers, named Hora and Tempus, who made very fine watches. Both were highly regarded, and the phones in their workshops rang often—new customers ordering watches all the time. However, while Hora prospered, Tempus became poorer and poorer and finally lost his shop. What happened? The watches consisted of about 1000 parts each. Whenever the phone rang and Tempus had to put down a partly assembled watch, it immediately fell to pieces and its assembly had to restart from the basic elements. The better customers liked his watches, the more his phone rang, and the less watches could be produced. Hora’s watches were equally complex, but she had designed them so that she could make subassemblies of about ten elements. Ten of these subassemblies form a larger subassembly; and a system of ten large sub-assemblies constitutes the whole watch. Thus, when Hora had to put down a subassembly to answer the phone, she only lost a small part of her work Simon’s quantitative analysis (cf. chapter 8 in [6]) reveals a 4000-fold productivity advantage, which moreover increases with product complexity, for Hora. It leads to the conclusion that the pyramidal structure of holonic systems constitutes “a law of the artificial.” Note also that Simon’s analysis implicitly defines what such a pyramidal structure is. Any system that fails to support adequate disturbance containment through such a structure is not a holonic system – although it may appear to have a pyramidal structure and/or a nearly decomposable configuration (e.g. rigid hierarchical designs). This “law of the artificial” holds in all situations characterized by a dynamic and demanding environment. It is not a matter of choice, preference, style or policy. Nonholonic systems are either simple (i.e. basic components) or they enjoy a stable or
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
271
non-competitive environment. Any non-trivial system in a demanding and dynamic environment will be holonic or it will not be. In practice, the humans within the system are elemental in rendering it self-adaptive and self-organizing. 2.3 Koestler’s Holonic System Simon discovered the essence of holonic systems but did not introduce the wording “holonic system”. Koestler proposed this terminology [2]. He equally pointed out that subsystems in a holonic system—the holons—have a dual character or, as Koestler calls it, a Janus face. Holons have to be significantly more stable than the larger systems/holons to which they belong. At the same time, holons have to provide suitable services to these larger systems/holons. This is often expressed in the following manner: holons must be simultaneously autonomous and cooperative. Speciously, Koestler entirely attributes all stability-providing capabilities of holons to their autonomy. The correct insights encompass much more. First, a majority of the smaller systems/holons enjoy superior stability because they are smaller. Their exposure to the dynamics of the environment is reduced in comparison to the exposure of the larger holons, which are by definition exposed to the union of the exposures of their sub-holons. Second, the survival, emergence and success of any non-trivial holon depend on its membership of some strong autocatalytic sets (see section 2.4 below). Autonomy is one mechanism—amongst many other mechanisms—to reduce exposure to the dynamics of the environment and/or to facilitate membership of autocatalytic sets. However, autonomy also increases the complexity/size of holons, which has its price. Koestler equates the ability of holons to be subsystems of some larger holon to their cooperativeness. Simon’s parable reveals that, in dynamic environments, the larger holons only start to exist when they are able to retrieve suitable sub-holons. In other words, initially the sub-holons need not to be cooperative. It is the larger holon that accommodates the characteristics of the prevailing sub-holons that it may use. Of course, the ability and tendency of successful holons to maximize their membership of autocatalytic sets results in a cooperative nature. When the larger holon is successful in creating suitable resources-providing autocatalytic cycles, the subholons’ ability to adapt and evolve will make them increasingly more cooperative. This commonly results in variants of the sub-holons that can only survive within the larger holon (e.g. organs in a human body). Note that humans and natural systems provide this adaptation and evolution today. Designing artificial systems with such abilities is not yet within today’s state-of-the-art. 2.4 Complex Adaptive Systems and Autocatalytic Sets Research on complex adaptive systems theory reveals further laws of the artificial [7]. These laws are applicable when bounded rationality is combined with a competitive and dynamic environment (and explain to some extent why and how the real world happens to be so dynamic and demanding). The research actually is assuming some extremely bounded intelligence (i.e. randomized search and performance-based selection). However, for the larger infrastructures and/or systems developed by human societies, this assumption applies to a sufficient extent to make the insights relevant.
272
P. Leitao, P. Valckenaers, and E. Adam
Human intelligence improves on random search but is unable to make a qualitative difference when developing sufficiently large and complex systems. As an analogy, humans may travel faster on a bicycle but they still travel only were pedestrian may arrive at some later point in time. Among other concerns, complex adaptive systems researchers attempt to understand and explain how life did emerge out of basic organic material by chance. The best-known basic theory for the emergence of life is Darwin’s theory of evolution, in which new life forms (or to-be-life-forms) are combined in a random fashion whilst survival of the fittest performs the selection. However, the probability that random combinations in a pool of organic material, when lighting bolts deliver the necessary energy, result in some of the smallest existing life forms is much too small. Estimates of an upper bound on the age of the universe in combination with upper bounds on the amount of parallel processing indicate that life is highly unlikely to emerge. The most significant improvement on Darwin’s theory has connections to Simon’s insights in holonic systems. In this theory, random combinations of organic material no longer have to produce life forms from scratch. Instead, they create small sets of larger molecules/structures that are catalysts to themselves and sufficiently stable to survive until the next energy input: autocatalytic sets. This is sufficiently likely to happen. Moreover, it results in a pool of organic material in which the autocatalytic sets consume all the available raw material that is suitable for their reproduction. Eventually, the pool is populated with larger molecules. In addition, there is a stable supply of these larger molecules because of the autocatalysis. Then, the process of building autocatalytic sets repeats itself, creating ever-larger organic structures until the size and complexity needed for life forms is reached. There is plenty of empirical evidence to corroborate this theory. All know life forms reproduce themselves. Rabbits and weeds even have a solid reputation in this respect (reproduction is a subclass of autocatalysis). Evidently, a multitude of other mechanisms (e.g. symbiosis) is present in natural systems, resulting in the complex adaptive ecosystems that constitute the dynamic and demanding environment that we all know. In artificial systems, the autocatalytic sets comprise natural elements, mostly humans. There are typically two categories of autocatalytic sets: • The economic autocatalytic set or resource-providing set. Artificial systems need to bring economic value such that our society provides the necessary means to develop, produce, maintain and sustain the artificial systems. • The information autocatalytic set. The users of an artificial system generate information that is needed to adapt and evolve its design. As an example, many species of beautiful roses only survive because of their membership in such autocatalytic sets. Left to themselves, they are unable to survive in nature. Similarly, cars, shoes and power grids only exist because of the above. In this context, self-adaptive and self-organizing systems are able to deliver useful services to more people over longer periods. Thus, they benefit more from the autocatalytic set mechanism. However, these systems are more complex and likely to require more resources and information for their creation and operation. Summarizing, this section presents insights in holonic systems and related insights in complex adaptive systems. In contrast to research into the development of self-adaptive
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
273
and self-organizing systems, these insights are ‘invariants’. They are laws of artificial systems. Ignoring these laws is analogous to ignoring gravity in civil engineering. It should become clear that the pyramidal structure of holonic systems serves a specific purpose; it is this purpose that defines whether a design is holonic. Moreover, a holonic system can be fractal, bionic, etc. These ‘alternatives’ are not competing but complementary to the holonic system insights.
3 Robustness in Holonic Multi-Agent Systems After analyzing the foundations of holonic systems, this section discusses the impact of a holonic multi-agent systems organization on the robustness of a system, and it investigates how robustness supports the implementation of self-organized and selfadaptive systems. 3.1 Notion of Robustness Robustness is a fundamental property in the analysis and control of dynamic systems, taking on even more importance when they are complex. In biology, robustness is an inherent property. According to Kitano [8], robustness “is a property that allows a system to maintain its functions against internal and external perturbations. It is one of the fundamental and ubiquitously observed systems-level phenomena that cannot be understood by looking at the individual components. A system must be robust to function in unpredictable environments using unreliable components.” Similar definitions can be found in the literature, such as “the ability to maintain performance in the face of perturbations and uncertainty” [9]. In control theory, the concept of robust control defines “a method of applying stable control over the system such that proper control is guaranteed even if the model deviates from the real system due to modeling errors” [10]. Having the previous definitions in mind, it is clear that a system, namely a biological organism or a computer application, is "robust" if it is capable to remain working correctly and relatively stable, in the presence of internal faults or stressful environmental conditions. It can be interpreted as the degree of the system ability of handling exceptions and tolerating faults. In fact, this property can describe the way the system works under the following situations: • The system is resistant and not wholly affected by a single failure in one of its parts, namely due to internal disturbances such as machine failures. • The system resists well under or recovers quickly from exceptional circumstances, namely variations (sometimes unpredictable) in its operating environment. The analysis of robustness is concerned with the study of the impact of internal and external disturbances on the operation and performance of the system. A low degree of robustness may affect a system’s performance strongly while its fail to function properly until the situation is restored to normality. In fact, if a system has weak capability to resist to disturbances, its productivity and performance decreases in presence of unexpected disturbances. Additionally, a weak robustness has also impact at
274
P. Leitao, P. Valckenaers, and E. Adam
other levels, especially in terms of achieving the pre-established goals and guaranteeing Quality of Service (QoS). The measurement of the system robustness is not an easy task since it deals with a qualitative and subjective performance parameter. A robustness benchmarking should measure how a system reacts to possible erroneous inputs or environmental factors that could affect the system. Ideally, it is necessary to exercise the system with all possible errors, conducting to an absolutely robust system. However, in reality, it is not possible to test all possible natural errors that can occur in the system (verifying the system operation and waiting for the occurrence of errors that occur infrequently is too time-consuming). Until now, there has been no effective approach to measure the robustness of a control system quantitatively. A possible way to measure the system robustness is introduced by [11], that uses a set of what-if tests to extract conclusions about the system robustness:
• Change the configuration of the system layout, e.g. introducing a new resource, • • • •
removing a resource or modifying a resource’s skills. Does the system remain stable? Introduce failures in the resources according to a disturbance model and breakdown a centralized component (such as a central scheduler). Does the system remain stable? Introduce a new type of production order or increase the number of production orders. How does the system respond? Introduce a new scheduling algorithm or change the rules for the decisionmaking. Does the system remain stable? Introduce data type errors in the content of the messages used for intra- and inter-communication. Does the system remain stable?
A generic metrics system to measure robustness remains an open issue, requiring additional efforts to generalize robustness benchmarking. 3.2 Robustness in Terms of Self-organization and Self-Adaptation In complex and dynamic systems, there is a strong connection between the robustness and the evolution properties: robustness property enables complex systems to evolve, and evolution enhances the system robustness. In fact, the essence of robustness, i.e. the objective to maintain the system functionality against perturbations, often requires the system re-configuration, by changing the behaviour of the system components and/or the way they are organized. Holonic rationale offers the possibility to design reconfigurable systems, where the autonomous holons can be added, removed or modified on the fly, without affecting the functioning of the other holons or the system as a whole (e.g. stopping, re-programming or re-initializing the other holons), as illustrated in Fig 1. The decentralization and autonomous nature of holonic multi-agent systems permits the achievement of very robust systems that cope with perturbations, faults and hard constraints, especially when compared with traditional centralized and hierarchical structures.
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
275
Fig. 1. Re-configurability achieved in HMAS enables Robustness
On the other hand, incomplete or even erroneous information will affect system robustness and performance negatively. Decisions no longer will be optimal or even beneficial and in certain cases will be myopic. The re-organization of distributed and autonomous holons constantly faces the emergence and evolution of the system. The evolution of a complex and dynamic system can use very powerful mechanisms, some of them inherited from nature, such as self-organization and self-adaptation. The Product-Resource-Order-Staff Architecture (PROSA+ANTS) [12-14] and ADAptive holonic COntrol aRchitecture for distributed manufacturing (ADACOR) [15] are two examples, addressing robustness and reconfigurable solutions, which show self-adaptation or self-organization in holonic multi-agent systems: PROSA+ANTS showing self-organization and ADACOR showing self-adaptation. In spite of the relationship between re-configurability and robustness, evolution can cause a loss of robustness: the dynamic re-organization of the system can provoke the instability of the system. Special attention should be devoted to the use of mechanisms that guarantee that the system remains stable, which are a necessary (but not sufficient) condition for robustness [14; 16]. How PROSA+ANTS achieves self-organization. In PROSA+ANTS systems, the intelligent products [14] take the necessary steps to get themselves produced [13]. Each intelligent product corresponds to an Order Holon, which manages the corresponding activities within the manufacturing systems in a decentralized manner. This collection of Order Holons self-organizes production in cooperation with intelligent resources (Resource Holons) and intelligent product types (Product Holons). To achieve such self-organizing factory-level control, each Order Holon utilizes two swarms of lightweight agents. Sometimes, these swarms are called a Delegate MAS [17]. The members of the first swarm virtually execute one possible product routing and the corresponding sequence of processing steps. Together, the members of this swarm explore the search space of possible ways to perform the required production steps for the Order Holon that generates them. These members create and maintain a set of possible routings and step sequences that satisfy the needs of their Order Holon. The performance parameter values for such a set member (e.g. lead-time) are estimated through virtual execution.
276
P. Leitao, P. Valckenaers, and E. Adam
Each Order Holon selects a possible way (i.e. product routing and processing steps at every resource along this route) to be produced from the above set. This is the intention of the Order Holon. The members of the second swarm announce the intentions of their Order Holon to the resources. They virtually execute their Order Holon’s intentions and reserve the required production capacity with the Resource Holons along the selected routing. In the above, robustness is achieved in the following manners:
• The intelligent products discover the factory, whatever its layout may be. In-
•
•
•
•
•
deed, the swarm members start – virtually – from a single location in the factory (i.e. the current position of the product or its final delivery point) and a virtual instant in time (i.e. respectively the current time or the due date). Initially, they know a single Resource Holon. During virtual execution, every known Resource Holon provides a link to each of its neighboring Resource Holons (actually through sub-holons that corresponds to connections, exits and entries). This information is available to the swarm members, which virtually visit the resource holons. Thus, the system auto-configures, and this is its normal operating mode. This discovery is an ongoing activity. The swarms continuously rediscover the factory and forget what has been discovered earlier. If something changes in the factory, if some disturbance occurs, the swarms discover this after a short period of time and adapt. Change and disturbances are business-as-usual. The Order Holons inform the Resource Holons of their intentions through the second swarm, which makes the necessary reservations. This enables a virtual execution that accounts for the expected loading of the resources. Thus, the system detects and computes a forecast of the near-future states of both resources and products in a self-organized manner. Again, any reconfiguration (or exotic situation) will be handled as business-as-usual. This near-future forecasting is a non-stop activity. The Order Holons, through their second swarm, continuously reconfirm their reservations. Failure to reconfirm causes the Resource Holons to discard/forget a reservation. The change and disturbances are accounted for during a subsequent reconfirmation. Order Holons may change their intentions. When changes or disturbances occur, an Order Holon may perceive superior ways to be produced. The system allows the Order Holon to change its intention. The Order Holon’s swarm will stop reconfirming the old intention and will make reservations for the newly selected intention. The swarm will maintain the latter until the Order Holon decides to change again. Order Holons change intentions conditionally. The Order Holons will not change intentions lightly. There exist stabilizing mechanisms to protect the validity of the near-future forecast [16].
Overall, a PROSA+ANTS system makes minimal assumptions about the production system, the products and the situation. It achieves robustness because it does not recognize a nominal or normal state or operating mode. It considers the entire state space and possible trajectories of the underlying production system to belong to the normal operating mode. It delays the distinction between well-performing and poor-performing control
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
277
mechanisms as much as possible (i.e. to plug-ins that can be replaced at any time without the need to adapt/enhance the remainder of the system). Moreover, the responsibilities in a PROSA system are divided such that a change in the environment (e.g. the introduction of a new product variant) affects solely the directly associated holons (i.e. the product holon). The system’s robustness originates both from its self-organizing capabilities and from its structural (decomposition) properties. Recent research is looking at robustness in schedule execution [18]. In this enhanced version, the first swarm dedicates 25-75% of its efforts to discover routings and processing step sequences to comply with a schedule (supplied by a Staff Holon). The robustness is relative to the scheduling information (e.g. missing or stale data…). Moreover, research is looking at trust and reputation to handle unreliable information sources (e.g when a Resource Holon that promises to deliver tomorrow or mañana). How ADACOR achieves self-adaptation. ADACOR is built upon a community of holons representing the manufacturing components, defining four manufacturing holon classes [15]: product (PH), task (TH), operational (OH) and supervisor holon (SH) classes, according to their functions and objectives. The product, task and operational holons are quite similar to the product, order and resource holons defined in the PROSA reference architecture [Van Brussel], while the supervisor holon presents characteristics not found in the PROSA staff holon. The supervisor holon introduces coordination and global optimization in decentralized control and is responsible for the formation and coordination of groups of holons. In ADACOR, robustness is achieved by the following ways:
• The system can be reconfigured on the fly, i.e. adding/removing/modifying a holon without the need to stop, re-program and re-initialize the other holons.
• Change and disturbances are treated as normal situations, while the system tries to forecast the future disturbance occurrences, based on the historical data, instead of simple reacting to them. ADACOR introduces an adaptive production control approach based on the selforganization of each ADACOR holon and pheromone-like propagation techniques. The self-organization is regulated by the autonomy factor, which fixes the level of autonomy of each holon, and evolves dynamically in order to adapt the holon behavior to the changes in the environment where it is placed. The system evolution is governed by a decision mechanism, and the overall efficiency of the self-organization is dependent on how the learning mechanisms are implemented, and on how new knowledge influences its parameters. Basically, the system is designed to as decentralized as possible and as centralized as necessary, i.e. evolving between a centralized approach when the objective is the optimization and a more heterarchical approach in presence of unexpected events and modifications. The self-adaptation exhibited by ADACOR also allows improving the system robustness, either to deal with disturbances or to handle opportunities to evolve according to the environment constraints.
278
P. Leitao, P. Valckenaers, and E. Adam
3.3 Applications Where Robustness Is Required Robustness is the key issue for the use of new technologies, especially for enterprises in which their competitive market gives execution/lead times a crucial importance. Robustness can be defined as sets of responses against perturbations, having different level of severity, which act on system performance. Robustness is a need especially for systems that evolve in a dynamic environment that are partly unpredictable, and where the activities are non-deterministic. Regarding information technologies, where multi-agent systems are often used [19], robustness is already take into account in most of the systems in term of: duplication of data servers, communication validated by acknowledgements and retransmission in case of failure, cryptography of data exchanges. In the area of information retrieval, the notion of noise, the adequateness of the data relatively to the demand, the pertinence computation, also play a role in the robustness or efficiency of the information systems. These challenges often are managed using algorithms based on Bayesian networks and/or neural networks (cf. [20]). Robustness against abnormalities in inputs, or computation, will be manageable most of the time; the exceptions are caught and a correction algorithm is applied. By learning algorithms, data-mining models, the lack of data (due to a communication problem or to a misunderstanding) is compensated for. On the other hand, robustness relative to the loss of a node, a computer or a server, especially in highly distributed architecture, is not taken fully into account. Indeed, the resynchronization (of information with reality) after a ‘‘crash’’ is still a key issue in information technologies, as in other application domains. This kind of events implies to define resilience mechanisms, and especially in case of distributed systems, some self-adaption and/or self-organization capacities. There is a threshold for each system, defined by the limit of performance degradation from which the system decides to use self-* capacities rather than correction of input. The only negative impact on the application of a robustness capacity could be, in case no perturbation occurs, a slowing down of the performance, due to the control mechanism. Therefore, the robustness has a cost, which should be reserved to complex and distributed systems, evolving in a non-deterministic environment. Holonic multi agent systems are particularly adapted to these systems; their notion of both a distributed and a federated control allows to bring the evaluation of performance in all part of the systems, and to manage large re-organization if needed. Indeed, robustness is antinomic with the notion of centralized control; it must be distributed among the systems to manage, on control units. Several applications have been modeled, designed or developed by using concepts of Holonic multi agent systems, mainly in the perspective of manufacturing systems management. In this area, manufacturing control units must be able to adapt to changing environments and handle emergent contexts. A project, initiated by HEO (Holden Engine Operations, a major Australian manufacturer of automotive), aimed at to develop a strategy to migrate existing manufacturing control systems to holonic control systems. It uses POC++, a holonic extension of the POC (part-oriented control) architecture composed of manufacturing and interfact agents [21].
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
279
Other applications need robustness, or control issued from holonic systems, like the coordination of SME production, or the Virtual Enterprise management, in the aim to propose a new management of individual enterprises, that need to be able to co-ordinate their work, like in an agricultural SME network [22]. Decision-making Support Systems must also be robust; we can find an example of application in the area of a medical diagnostic system in [23]. In this example, swarm agents are coupled to holonic agents in order to help users to find a diagnostic; this system is especially pertinent in case of complex cases. Logistic is a highly dynamic domain, especially in emergency situations, and there is a need of robustness against the perturbations; to answer to this problem [23] proposes an architecture of holonic multi-agent system, based on the notion of emergence (of holarchy), to enable the dynamic creation and optimization of flexible ad-hoc infrastructures for emergency e-logistics management. Conversely, regarding the management of a global road transportation system, [24] proposes a top-down approach, with a holarchy of real-time controller coupled at the bottom of the system to cabled Road-Units (RU). Dynamic Job Routing is also an application case where notion of dynamic control is important. [25] proposes a reconfigurable manufacturing control, with a cooperative control algorithm, that aims at find the best manufacturing treatment between resources (machines) and components (raw materials) by a negotiation approach. An implementation has been evaluated on case studies. Coordination of controller, responsible of a local section of a conveyor, embedded in IEC 61499 Functional Blocks, has also been proposed in [26] to control airport Baggage Handling Systems. In [27], a multiagent system is proposed to control Shop Floor Assembly. In this proposition, components can enter or leave the system without major variations in the production process that can be reconfigure during run-time. [28] proposes a work in the warehouse control domain. The control is distributed in Logic Holons, which acts with Order Holons and Resource Holons. Holons are able to adapt their behaviors relatively to the dynamic of the problem, or to a broken-down equipment, by reorganization or behaviors change. Overall, robustness is achieved in most applications by a decentralized control, distributed on specific components of systems. In order to have a coherent management of the control, these autonomous components have to cooperate.
4 Cooperation and Autonomy Cooperation is the process of acting together to achieve a global system goal, or shared goals. It implies that system components cooperate between them, and thus leads to the creation of groups, structures (holarchies, ...). An entity is considered as autonomous not only if it is able to choose some actions to reach its goal, or its desires, but also if it has some ability to alter one’s preference [29]. Cooperation sometimes leads to the temporary modification of the agent’s goals, by adding, removing, changing priority of a constraint (a goal can be seen as a set of constraints, like in [30]). Autonomy in holonic systems is essential, especially in dynamic environment, to adapt rapidly to evolution, and guaranty the robustness. Nevertheless, the particular structure of holonic system imposes social and cooperative rules that restrict this autonomy.
280
P. Leitao, P. Valckenaers, and E. Adam
We do not conceive that cooperation can occur without the sharing of an agenda or the views of the others; i.e. the classical collection of wood by termites is not the result of a cooperative work, but just the emerging result of individual acts. The competition in multi-agent systems is not necessarily an opposite condition to the cooperation; indeed, to be more competitive, some agents can decide to join a group and to cooperate. Thus we can have cooperative systems with egotist agents. Co-petition, term appeared in the last century, is a cooperative competition; it occurs when competing systems share and cooperate on some parts or issues where no competition is reported. This kind of behavior commonly is used by large enterprises, especially when the environment is particularly demanding. An agent detects a need to cooperate with other agents when this agent has a problem of resource availability or of capabilities to achieve its goal. There is recognition of the necessity of cooperation: • •
If the agent has a lack of resources, or owns them but does not want to use them, it tries to get them from other agents; If the agent has not the competence to achieve its goal, or if its properties (memory, computational capacity) are too weak.
The cooperation can be explicit following a communication between agents, or implicit according observations of agents between them. When an agent is solicited to cooperate, and if it is agree to do so, it can choose either of the two categories of cooperation [31]: • •
Negotiation, where the agent chooses the cooperative actions that help itself to come closer to its goal Deliberation, where agents are more altruistic, and where the agent chooses the actions that are more pertinent for the group to complete its goals, than for itself.
It is underlying that cooperation, for a given agent, is not possible if the cooperative act would involve a negative impact on its goal, and no positive impact for its group. Relatively to Holonic Multi-Agent System, cooperation means that even if each agent classically has its own goals and strategies in the pyramidal organization, all agents taken together have to lead the Holonic Multi-Agent System collectively to its intended global goal. This property generally is processed by letting the agents locally and autonomously modify their roles or behaviors, by adapting their strategies of working with other agents; it is a self-adaptation property. 4.1 Notion of Cooperation and Autonomy Section 2 has already shed some light on cooperation and autonomy in holonic systems. Cooperative behavior of a holon results in its participation in autocatalytic sets. This participation brings access to resources (i.e. economic means and useful information) that foster this holon’s success and survival. The cooperation consists of interactions aimed at common goals, which are embodied by higher-level holon(s). The achievement of the common goals triggers and sustains an autocatalytic cycle feeding the participating holons with resources that will sustain and foster them.
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
281
The autonomy of a holon enables it to cope with a dynamic environment, where this environment includes elements from both the enclosing holarchy (i.e. other holons) and outside this holarchy. In other words, their autonomy allows holons to deliver their services in a wide range of conditions. Because of their autonomy, holons enjoy autocatalysis to a larger extent and during longer periods. Self-organization and self-adaptation are promising to contribute significantly in this respect. Importantly, the autonomy of a holon shields this holon from disturbances in the sense of Simon’s watchmakers parable. When the overall holonic system experiences turbulences from its dynamic environment, the autonomy of the lower-level holons will shield them from most of these disturbances. Even more important is the mutual shielding from the environment dynamics. Holons avoid propagating disturbance within their scope and responsibility toward their neighboring holons. Conversely, holons minimize requirements imposed on neighboring holons, which makes it easier for those neighbors to handle and contain their share of the environment dynamics. To achieve the required autonomy, holons need to possess adequate rights over resources. These rights correspond to decision space in which the holon implements a suitable strategy; the holon needs to have sufficient decision autonomy. The decisionmaking processes within a holon ensure its success, survival and development. The decision autonomy of a holon ensures the right ownership of the required resources. The cooperativeness ensures that these resources serve to obtain the common goals. Importantly, cooperativeness implies minimizing the rights over resources needed to achieve those goals and to ensure survival. Many non-holonic system designs fail in this respect when they have no explicitly managed resource allocation. They implicitly rely on a course-grained static resource allocation mechanism. This results in poor or non-existent configurability or joint task execution. To ensure its survival and success, holons need to reduce the uncertainty regarding the required resources. In a turbulent environment, holons have no absolute certainty about the external supplies. Holons must survive temporary shortage, quality and type variability, etc. In other words, holons need resource autonomy in the manner cars need a fuel tank. Moreover, holons need autonomy that enhances their chances of survival. The key element is the stability and longevity of the autocatalytic sets to which they belong. Holons need to preserve the benefits of these sets that are exposed to an everchanging environment. In addition, the holon needs to recognize novel opportunities and join/create new autocatalytic sets. This renewal needs to outpace the disappearance of sets that no longer sustain their autocatalysis because of changes in the environment. This is the survival autonomy of a holon. 4.2 How Cooperation Can Limit Autonomy? Cooperation is the process of acting together to achieve a goal, often when this goal is unreachable by one single agent, either by lack of resources, or of capabilities. Therefore, the cooperation is a collaborative process induced by constraints or profit opportunities; these can originate from a demanding environment, or from the nature of the system goal.
282
P. Leitao, P. Valckenaers, and E. Adam
In the first case, the cooperation consists in obeying to hard constraints; there actually is no limitation of autonomy coming from the cooperative nature of the agents. In the second case, the autonomy is restricted by the necessity to reach the global goal of the system or to optimize its general state, beyond the individual interest of the agents. In holonic multi-agent systems, a holon is a particular agent composed of other holons that must not be member of other holons. Holon are autonomous but the cooperation inside a holon could lead its members to have a restricted autonomy; indeed, in case of hard constraints, a forced cooperation could be used. In [32] two cases are identified: (1) when the nature of the problem implies that the holon sequences the actions of its parts; (2) when a member agent may lack a capability (or has insufficient capabilities), in which case it is necessary to group other members together to answer to the problem. In some applications, the notion of mediator agent is used; this agent allows other agents to find partners to cooperate. For example, in [23], different levels of mediator are proposed; indeed, if a contacted agent refuses to cooperate (due to problem of resources, capabilities, for example), and if this implies a blocking situation, a higherlevel mediator has to propose a solution on the basis of all its subordinated agents. A forced cooperation can be useful in emergencies. However, during a normal execution of operations, it is more interesting to let the agent cooperate by themselves; for that, methods exist in self-organizing systems. Indeed, it also is possible to not organize the holonic agents, but just monitor them and penalize them by a cost to be paid when they violate a social constraint. Thus, an agent is free to choose whether to continue or not its asocial behavior, relatively to its personal benefit and the cost it has to pay (we can find an example of this solution in [33]). The AMAS theory [34] proposes a definition of a cooperative failure of autonomous agents when there is either: an incomprehension of the perceived signal; or an unproductiveness of the received information (no utility for the agent); or a uselessness of the agent actions for others (according to its belief). In the two first cases, the agent ignores the message or transmits it to agents that could process it. In the third case, the agent tries to relax its constraints in order to be able to choose an action useful for other agents. To establish an agreement in the set of actions between the holonic agents to resolve a conflicting situation towards a common goal, principles of negotiation can be used. The negotiation guaranties the autonomy of the agents that move their constraints or desires in order to obtain a global satisfaction. However, these principles pose the problem of confidence between the agent that negotiate relatively to their level of resources or capabilities. Interesting research results are discussed in the framework of Egalitarian Social Welfare [35] that aims to lead all the agents of the system to reach an acceptable position, with an optimization of the global benefit of the system. In the same context, the notion of an auction protocol could be used. For example, [36] proposes the application of an auction protocol to resolve the problem of the selection of the most deserving bidder to resolve a deadlock in Manufacturing Systems. The auction must take into account the personal profits of the agents, and the fact that it is necessary for the system to reach at the highest its global goal. In this work, the balance between autonomous entities priority and the global system satisfaction can be adapted.
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
283
This choice is present in the work of [37], where an imposed goal is defined by as a high priority goal that limits the autonomy of a holon. In this research, the notion of a stygmergic holarchy (holonic multi-agent system where parts have stygmergic capabilities) emerged from user needs is used, as well as notions of need (for grouping), priority (intended goal/imposed goal) and leaving (of the holarchy). These ratios allow to tune the degree of autonomy of the holons, and to find the good compromise relatively to the application. The ratio autonomy/cooperation is in fact dependant of the environment, of the application; it is not possible to generalize and give the golden ratio. To find this ratio, two solutions appear: to develop, to test and to adapt the ratio, if it is possible; or to let the entities free to adapt this ratio by themselves. The first case allow to propose “rapidly” a first solution, admittedly sub-optimal, but usable; while the second solution necessitates to let the agent self-adapt by themselves and choose the best ratio, according to an evaluation of the global system performance that has to be implemented in the holons. However, the second solution is the only one, which is able to face the evolution of the environment without external intervention; that is essential in holonic system evolving in dynamic environment. 4.3 Detection and Sanctions in Norms Violation Open distributed agent systems—as holonic multi-agent systems are—are heterogeneous societies, where agents do not necessarily share the same interests, do not know and might not trust each other, but can work together and help each other. Norms play an important role in such systems to cope with the heterogeneity, the autonomy and the diversity of interests among their members, improving coordination and cooperation [38]. As in real-world societies, norms allow us to achieve social order by controlling the environment, making it more stable and predictable. Norms are abstract rules usually used to regulate the agents’ behavior and to ensure the necessary trust enabling the establishment of commitments among agents. In the literature, norms are classified according to different criteria. As an example, [39], under the Electronic Institution framework, classify norms according to their scope (i.e. institutional, constitutional and operational norms) and Therborn [39] classifies norms based on how they function in human interactions (i.e. regulative, constitutive and distributive norms). In terms of norms implementation, two distinct approaches may be used: defining constraints on unwanted behaviors or defining violations and how to react to those violations. The agents operating in holonic multi-agent systems adopt a flexible and autonomous approach, which means that in spite of being aware of existing norms, they are capable of violating those norms, be able of learning new ones, negotiate upon norms, influence and persuade them, and control and monitor others’ behaviors. In fact, the autonomy left to the agents tends to deviate from their ideal behavior, and consequently violate the norms. Violations can occur in the following cases [40]:
• An obligation is not fulfilled by the end of the period of obligation. • A prohibition (forbidden) activity occurs in the duration of prohibition.
284
P. Leitao, P. Valckenaers, and E. Adam
Violations are sometimes allowed, or even encouraged, since sometimes it is better to do the “damage control” at the end instead of simply trying to avoid the damage from happening, allowing a more robust system. Using violations it is necessary to adapt sanctions to regulate the agents’ behavior. In order to control the system operation in accordance with the norms, and detect and handle violations, normative systems have enforcement mechanisms, which define extra regulations that should include the two following aspects: checking and reaction. By enforcing norms, it is possible to conduct and supervise the behavior of rational agents. The checking part of the norm specifies the policy of the normative system for detecting the violation, including who and when the system will check to detect violations [39]. The detection of the violation is a complex task in distributed systems and different mechanisms can be applied, some of them have random checks to detect violation or some of them check the system based on a schedule, but both verifying when the deadline have elapsed. Related to who is detecting the violation, several possibilities can be considered: the environment, an external human, the agent that has established an agreement with the violating agent, an upper agent and a policy management agent. The violations should be detected as soon as they occur and in some cases, violations of certain expectations need to be detected without waiting for some event to happen. Constraint propagation techniques seem suitable to be used to detect as early as possible whether an expectation will never be fulfilled [41]. Once the violation is detected, the decision about sanctions must be considered, and for this purpose the reaction part of the norm defines the reaction procedures against the violation. The sanction mechanism may include punishments (when a violation occurs) and rewards (when no violation occurred), that can be previously defined (i.e. static sanctions established from designers) or during the run-time operation of the agents when they establish agreements as result of negotiation processes (i.e. sanction rules built doing runtime). When dealing with the fulfillment or violation of norms, it is crucial to update the agent’s reputation, increasing or reducing it according to the enforcement of the norm. This issue is crucial in trust-based and reputationbased systems, which play an indispensable role in system working with incomplete or even erroneous information. The following example illustrates a norm definition, including the specification of how it is checked and what are the reaction procedures: Norm:
The machine agent is obliged to execute the operation belonging to the task agent in the mutually agreed deadline. Check: The task agent should perform random checks of the operation status. Reaction: If the machine has not executed the operation by the deadline, then it will be fined accordingly. In the previous example, the norm is related to the cooperation of a task agent that has an operation to be executed and a machine agent that has skills to execute the operation. The check norm define who is detecting and when, and the reaction norm defines the sanction to be applied in case of the norm violation.
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
285
5 Conclusions Holonic multi-agent systems are pyramidal systems, which are—as are all their member holons—autonomous, robust and cooperative. Moreover, they are particularly suited to solve problems in dynamic and demanding environment, where the solutions that are more classical fail. These three properties are essential: 1.
Autonomy is fundamental to allow the system to adapt itself to the evolution of the environment, of the resources, of the inputs; 2. Robustness is essential for providing reliable solutions, able to be used in real cases in enterprises. The robustness is based on a (multi-level) control, distributed on each agent that are each responsible of local fault detection and recovery, and that interact between them to maintain the global robustness; 3. Cooperation is a key issue in holonic multi- agent systems: the objective of holonic multi-agent system is essentially to manage complex industrial processes, or complex networks of components/data, in order to lead the system to a global objective. The three key properties are positively correlated: • Robustness requires that the components are autonomous and capable to detect and react by themselves to perturbations; and needs cooperation between the holons to answer to severe constraints. • Cooperation needs the autonomy of the entities; the holons decide by themselves, without control, to communicate, collaborate, cooperate with other agents. However, we have also negative interactions between the properties: 1. The necessity of robustness constraints an agent’s autonomy: its objectives essentially are to protect the holonic multi-agent system against external perturbation, but this property has to ensure that no agent violates the social rules. 2. The cooperative nature of the holons allows them to find assistance when they are in critical state, due to a lack of resources and/or capabilities. However, the necessity to lead the system to an optimal state, or to its goal, can compel the holons to cooperate in order to release constraints on the jammed holons; and thus it can reduce their autonomy. An equilibrium has to be found between the three properties: too much autonomy leads to an highly adaptive system, but also to a possible system instability due to the low level of robustness and cooperation. Conversely, a limited autonomy of the holons implies a rigid holonic multi-agent system unable to adapt to the environment and the perturbations. This equilibrium between robustness, cooperation and autonomy can be defined 'a priori' at design level, according to the application and its inputs. But, in order to be able to manage the evolution of the environment, of the application (modification of resources, of material, etc.), it is essential that this equilibrium be computed, dynamically and on-line, by the holons themselves. Thus, notions of self-adaptation and self-organization are indispensable to obtain a reliable and successful holonic multi-agent system.
286
P. Leitao, P. Valckenaers, and E. Adam
There is in fact a double implication between holonic properties and self-* properties. Indeed, if holonic multi-agent systems need the use of self-* properties, developers of self-adaptable systems should take into account the necessity of robustness, cooperation defined by holonic properties if they have to set up a long-term viable system. The discussion presented in this paper is an illustration of this double implication. Acknowledgments. This paper presents work funded by the Research Fund of the K.U. Leuven Concerted Research Action on Autonomic Computing for Distributed Production Systems. The present research work has been also supported by International Campus on Safety and Intermodality in Transportation the Nord-Pas-deCalais Region, the European Community, the Regional Delegation for Research and Technology, the Ministry of Higher Education and Research, and the National Center for Scientific Research. The authors gratefully acknowledge the support of these institutions.
References 1. Wooldridge, M.: An Introduction to Multi-Agent Systems. John Wiley & Sons, Chichester (2002) 2. Köstler, A.: The Ghost in the Machine. Arkana (1990) 3. Tharumarajah, A., Wells, A.J., Nemes, L.: Comparison of the Bionic, Fractal and Holonic Manufacturing System Concepts. International Journal of Computer Integrated Manufacturing 9, 217–226 (1996) 4. Okino, N.: Bionic Manufacturing System. In: Peklenik, J. (ed.) CIRP Flexible Manufacturing Systems: Past-Present-Future, pp. 73–95 (1993) 5. Kwangyeol, R., Mooyoung, J.: Agent-based Fractal Architecture and Modelling for Developing Distributed Manufacturing Systems. International Journal of Production Research 41(17), 4233–4255 (2003) 6. Simon, H.A.: The Sciences of the Artificial. MIT Press, Cambridge (1990) 7. Waldrop, M.: Complexity, the Emerging Science at the Edge of Order and Chaos. Viking, London (1992) 8. Kitano, H.: Biological Robustness. Nature Genetics 5, 826–837 (2004) 9. Stelling, J., Sauer, U., Szallasi, Z., Doyle III, F.J., Doyle, J.: Robustness of cellular functions. Cell 118, 675–685 (2004) 10. Zhou, K., Doyle, J.: Essentials of Robust Control. Prentice-Hall, Englewood Cliffs (1997) 11. Leitao, P.: An Agile and Adaptive Holonic Architecture for Manufacturing Control. PhD Thesis, University of Porto, Portugal (2004) 12. Brussel, H.V., Wyns, J., Valckenaers, P., Bongaerts, L.: Reference Architecture for Holonic Manufacturing Systems: PROSA. Computers in Industry 37(3), 255–274 (1998) 13. Valckenaers, P., Van Brussel, H.: Holonic Manufacturing Execution Systems. CIRP Annals - Manufacturing Technology 54(1), 427–432 (2005) 14. Valckenaers, P., Saint Germain, B., Verstraete, P., Van Belle, J., Hadeli, Van Brussel, H.: Intelligent Products: Agere versus Essere. Computers in Industry 60(3), 217–228 (2009) 15. Leitão, P., Restivo, F.: ADACOR: A Holonic Architecture for Agile and Adaptive Manufacturing Control. Computers in Industry 57(2), 121–130 (2006)
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems
287
16. Hadeli, V.P., Verstraete, P., Saint Germain, B., Van Brussel, H.: A Study of System Nervousness in Multi-agent Manufacturing Control System. In: Brueckner, S.A., Di Marzo Serugendo, G., Hales, D., Zambonelli, F. (eds.) ESOA 2005. LNCS (LNAI), vol. 3910, pp. 232–243. Springer, Heidelberg (2005) 17. Van Dyke Parunak, H., Brueckner, S., Weyns, D., Holvoet, T., Verstraete, P., Valckenaers, P.: Pluribus Unum: Polyagent and Delegate MAS Architectures. In: Antunes, L., Paolucci, M., Norling, E. (eds.) MABS 2007. LNCS, vol. 5003, pp. 36–51. Springer, Heidelberg (2008) 18. Verstraete, P., Valckenaers, P., Van Brussel, H., Saint Germain, B., Hadeli, K., Van Belle, J.: Towards robust and efficient planning execution. Engineering Applications of Artificial Intelligence 21(3), 304–314 (2008) 19. Sugumaran, V.: Intelligent Information Technologies and Applications. Idea Group Inc. (2007) 20. Schwaiger, A., Stahmer, B.: Probabilistic Holons for Efficient Agent-based Data Mining and Simulation. In: Mařík, V., William Brennan, R., Pěchouček, M. (eds.) HoloMAS 2005. LNCS, vol. 3593, pp. 50–63. Springer, Heidelberg (2005) 21. Jarvis, J., Jarvis, D., McFarlane, D.: Achieving Holonic Control: An Incremental Approach. Computers in Industry 51(2), 21–223 (2003) 22. Mezgar, I., Kovacs, G.L.: Co-ordination of SMEs’ Production Through a Co-operative Network. Journal of Intelligent Manufacturing 9(2), 167–172 (1998) 23. Ulieru, M., Unland, R.: A Holonic Self-organization Approach to the Design of Emergent e-Logistics Infrastructures. In: Di Marzo Serugendo, G., Karageorgos, A., Rana, O.F., Zambonelli, F. (eds.) ESOA 2003. LNCS, vol. 2977, pp. 139–156. Springer, Heidelberg (2004) 24. Versteegh, F., Salido, M., Giret, A.: A Holonic Architecture for the Global Road Transportation System. Journal of Intelligent Manufacturing (2009) 25. Sheremetov, L., Muñoz, J.M., Guerra, J.: Agent Architecture for Dynamic Job Routing in Holonic Environment Based on the Theory of Constraints. In: Mařík, V., McFarlane, D.C., Valckenaers, P. (eds.) HoloMAS 2003. LNCS, vol. 2744, pp. 124–133. Springer, Heidelberg (2003) 26. Black, G., Vyatkin, V.: On Practical Implementation of Holonic Control Principles in Baggage Handling Systems Using IEC 61499. In: Mařík, V., Vyatkin, V., Colombo, A.W. (eds.) HoloMAS 2007. LNCS (LNAI), vol. 4659, pp. 314–325. Springer, Heidelberg (2007) 27. Cândido, G., Barata, J.: A Multiagent Control System for Shop Floor Assembly. In: Mařík, V., Vyatkin, V., Colombo, A.W. (eds.) HoloMAS 2007. LNCS (LNAI), vol. 4659, pp. 293–302. Springer, Heidelberg (2007) 28. Moneva, H., Caarls, J., Verriet, J.: A Holonic Approach to Warehouse Control. In: 7th International Conference on PAAMS 2009. AISC, vol. 55, pp. 1–10 (1999) 29. Dworkin, G.: The Theory and Practice of Autonomy. Cambridge University Press, Cambridge (1988) 30. Adam, E., Grislin-Le Strugeon, E., Mandiau, R.: Flexible Hierarchical Organisation of Role Based Agents. In: Second IEEE International Conference on Self-Adaptive and SelfOrganizing Systems, pp. 186–191 (2008) 31. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On Cooperation in Multi-Agent Systems. The Knowledge Engineering Review 12(3), 309–314 (1997) 32. Mondal, S., Tiwari, M.K.: Application of an Autonomous Agent Network to Support the Architecture of a Holonic Manufacturing System. The International Journal of Advanced Manufacturing Technology 20(12), 931–942 (2002)
288
P. Leitao, P. Valckenaers, and E. Adam
33. Gou, L., Luh, P.B., Kyoya, Y.: Holonic Manufacturing Scheduling: Architecture, Cooperation Mechanism, and Implementation. In: Proceedings of IEEE/ASME International Conference on Advanced Intelligent Mechatronics (1997) 34. Bernon, C., Camps, V., Gleizes, M.P., Picard, G.: Engineering Adaptive Multi-Agent Systems: The ADELFE Methodology. In: Henderson-Sellers, B., Giorgini, P. (eds.) AgentOriented Methodologies, pp. 172–202. Idea Group Pub., New York (2005) 35. Estivie, S., Chevaleyre, Y., Endriss, U., Maudet, N.: How equitable is rational negotiation? In: Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 866–873. ACM Press, New York (2006) 36. Opadiji, J.F., Kaihara, T.: Distributed Production Scheduling Using Federated Agent Architecture. In: Mařík, V., Vyatkin, V., Colombo, A.W. (eds.) HoloMAS 2007. LNCS (LNAI), vol. 4659, pp. 195–204. Springer, Heidelberg (2007) 37. Ulieru, M., Grobbelaar, S.: Holonic Stigmergy as a Mechanism for Engineering SelfOrganizing Applications. In: Proceedings of the 3rd International Conference of Informatics in Control, Automation and Robotics (ICINCO), pp. 5–10 (2006) 38. Cardoso, H., Oliveira, E.: Towards an Institutional Environment Using Norms for Contract Performance. In: Pěchouček, M., Petta, P., Varga, L.Z. (eds.) CEEMAS 2005. LNCS (LNAI), vol. 3690, pp. 256–265. Springer, Heidelberg (2005) 39. Therborn, G.: Back to Norms! On the Scope and Dynamics of Norms and Normative Action. Current Sociology 50(6), 863–880 (2002) 40. Derakhshan, F., McBurney, P., Bench-Capon, T.: Towards Dynamic Assignment of Rights and Responsibilities to Agents. In: Sarbazi-Azad, H., et al. (eds.) CSICC 2008. CCIS, vol. 6, pp. 1004–1008. Springer, Heidelberg (2008) 41. Alberti, M., Gavanelli, M., Lamma, E., Mello, P., Torroni, P.: Specification and Verification of Agent Interaction using Social Integrity Constraints. Electronic Notes in Theoretical Computer Science 85(2), 94–116 (2004)
Context Oriented Information Integration Mukesh Mohania1, Manish Bhide1, Prasan Roy2, Venkatesan T. Chakaravarthy1, and Himanshu Gupta1 1
IBM India Research Lab, Plot-4, Block-C, Institutional Area, Vasant Kunj, New Delhi, India – 110070 {mkmukesh,abmanish,vechakra,higupta9}@in.ibm.com 2 Aster Data Systems, CA, USA [email protected]
Abstract. Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating critical business information distributed across both structured and unstructured data sources. Academicians have focused on this problem but there still remain a lot of obstacles for its widespread use in practice. One of the key problems is the absence of schema in unstructured text. In this paper we present a new paradigm for integrating information which overcomes this problem – that of Context Oriented Information Integration. The goal is to integrate unstructured data with the structured data present in the enterprise and use the extracted information to generate actionable insights for the enterprise. We present two techniques which enable context oriented information integration and show how they can be used for solving real world problems. Keywords: Information Integration, Unstructured Data Integration, Context Oriented Information Integration, SCORE, EROCS.
1 Introduction Traditionally people have focused on integrating data from various structured data sources. However, as per analysts [28] around 80% of the data in an enterprise is unstructured in nature. Academicians have focused on the challenges in integrating structured and unstructured data but the problem is far from being solved. Thus, integrating information in the presence of unstructured data is a huge challenge that is faced by enterprises today. In order to motivate the problem further, we present a real world problem faced by banks today. A typical bank maintains different types of data such as customer profile data, customer transaction data, customer complaint data, etc. Most of the data that is being used by banks today is in relational format. However, banks also maintain large amounts of unstructured information such as customer emails, customer complaint data, customer phone call records etc. This information is currently a silo and is not used by the bank in their processes. The reason for the creation of these silos is not hard to fathom. In a typical banking setup, the growth of the bank leads to new systems being added from time-to-time. One of the first systems to be acquired is the data warehouse, which is A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 289–326, 2009. © Springer-Verlag Berlin Heidelberg 2009
290
M. Mohania et al.
absolutely vital for the bank. The data warehouse stores all the customer information such as customer profile, customer product holding, customer balance, etc. In other words, the data warehouse stores almost all the relational data about the customer (in a summarized format). With time due to the increasing customer base, the number of customer complaints increase. This necessitates a complaint management system which can keep track of all the customer complaints received by the bank. The customer complaints are generally in textual format and are not a part of the data warehouse. A similar problem exists with the email management system used by the banks to process customer emails. These systems for document management and the data warehouse are from different software vendors and are not integrated with one another. Further, the document management systems are developed for very specific purposes and they do not provide a mechanism for analysis of the text data any more than what is required by the application at hand. E.g., the email management system only deals with information that is necessary for the bank agent to reply to the email. It won't extract extra information such as customer band, product holding etc. which can be used for cross-sell of products or event based marketing. Thus the growth in an enterprise leads to the creation of isolated information sources which do not interact with each other. Hence the enterprise is unable to fully leverage the information present across the diverse data sources. From the perspective of the enterprise, an ideal setting would be to have a consolidated data warehouse that consists of relational as well as unstructured data. Thus, the major problems faced by enterprises today are: (1) Integrating information across the diverse structured and unstructured data sources present in the enterprise and (2) Extracting “extra" meaningful information from the unstructured data sources so that they can be used for marketing, business intelligence etc. In this paper we present a new paradigm for integrating information – that of Context Oriented Information Integration. The goal is to integrate unstructured data with the structured data present in the enterprise and use the extracted information to generate actionable insights for the enterprise.
2 Background Effective knowledge management, however, requires seamless access to information in its totality, and enterprises are fast realizing the need to bridge this separation. This has led to a significant effort towards integration of structured and unstructured data [1, 2, 3, 4, 5, 6, 7, 8, 9]. This work can be broadly classified into two categories based on the type of the query paradigm: 1) Keyword query based solution 2) SQL Query based solutions In the first paradigm, the relational data is exposed to the search engines as virtual text documents. The search engines then work on top of both relational as well as unstructured data and allow users to query them using keywords. This functionality is provided by commercial products like DB2 Enterprise Server Edition, DbXplorer and the BANKS project [29]. The BANKS work addresses the problem of keyword search in relational databases. Given a set of terms as input, the task is to retrieve the
Context Oriented Information Integration
291
candidate join-trees (group of rows across multiple tables in the database interlinked through foreign-key relationships) that collectively contain these terms; these candidate join-trees are also ranked against each other based on relevance. The biggest advantage of keyword queries is their simplicity. It does not require understanding of complex query syntax which is inherent to the use of SQL. However, this advantage is the biggest source of its disadvantage as well – Keyword queries are very less expressive as compared to SQL. For example, the following query cannot be represented using a keyword query: “Give me the information related to the five best performing stocks in the past week”. The second paradigm tries to avoid this problem by using SQL queries to query both relational and unstructured data.
SELECT stocks.price, docs.text FROM stocks, docs WHERE (stocks.name = ‘IBM’ AND CONTAINS(docs.text, “IBM”)) OR (stocks.name = ‘ORCL’ AND CONTAINS(docs.text, “ORCL”))
C1 A
C2 X
C3 1
A
X
2
A
Y
3
B
X
4
A
X
5
B
Y
6
B
X
7
B
X
8
DB2 UDB / WebSphere Information Integrator
CONTAINS(…)
Net Search Extender
Fig. 1. SQL Query Based Solution
In the second paradigm, text data is exposed to relational engines as virtual tables with text columns. The user then uses the SQL Query language to query both structured and unstructured data. Thus this approach provides a single point of access to both structured and unstructured data sources. Such an approach is used in DB2 NSE [10], which exposes the unstructured data as a virtual relational table (with one row per document) and introduces a CONTAINS predicate that can be used to filter the relevant documents from this table based on a set of keywords. While this alleviates the issue of having to interface with each source individually, the application still needs to formulate the SQL logic to retrieve the needed structured data on one hand, and identify a set of keywords to retrieve the related unstructured data on the other. This is not ideal for the following reasons:
292
M. Mohania et al.
o
o
The same information need is formulated using two disparate paradigms (SQL for structured data, and a set of keywords for the unstructured data), which is redundant effort. In many cases, it is hard (even impossible) for the application to identify appropriate keywords needed as above to retrieve related unstructured data.
Hence what is needed is a way to perform information integration which does not require the user to list a set of keywords and allows the user to extract meaningful information from the unstructured data. In this paper we present two such techniques which integrate unstructured and structured data. The first technique named as SCORE [11], automatically associates related unstructured content with the SQL query result, thereby eliminating the need for the application to specify a set of keywords in addition to the SQL query. Specifically, SCORE works as follows: 1) The application specifies its information needs using only a SQL query on the structured data. SCORE executes the query on the RDBMS. 2) SCORE “translates” the given SQL query into a set of keywords that, in effect, reflect the same information need as the input query. 3) SCORE uses these keywords to retrieve relevant unstructured content using a search engine. This unstructured content is then associated with the result of the given SQL query computed earlier, and returned to the application. The use of SCORE is illustrated in the following example. Consider an investment information system that helps users to analyze stock market data. The system maintains the stock ticker archived over the past week, company profile, information about the various institutional investors, mutual-fund portfolios, etc. (structured data). It also
Fig. 2. SCORE Overview
Context Oriented Information Integration
293
maintains a searchable repository of past week’s news stories, advisories and recent analyst reports (unstructured data). Now, consider the scenario illustrated in Figure 2, wherein the application submits a query asking for the names of the three companies with maximum stock price variation in the past week. With this query as input, SCORE explores the query’s result as well as neighboring tables containing related information, and identifies the keywords that are most relevant to the profiles of these stocks. These keywords form the context of the input query. In this example, these keywords are “IBM”, “ORCL”, “MSFT” (obtained from the stocks table) and “Database”, “Software” (obtained from the company profile tables). These keywords are then used to retrieve relevant news stories and related advisories and reports using the search engine, which are then returned to the application along with the SQL query result. The second technique called as EROCS [12] address the problem of linking a document with related structured data in an external relational database. EROCS views the structured data in the relational database as a set of predefined “entities” and identifies the entities from this set that best match the given document. EROCS further finds embeddings of the identified entities in the document; these embeddings are essentially linkages that interrelate relevant structured data with segments within the given document. As an example, consider a retail organization where the structured data consists of all information about sales transactions, customers and products. The organization, with a network of multiple stores, has a steady inflow of complaints into a centralized complaint repository; these complaints are accepted using alternative means, such as a web-form, email, fax and voice-mail (which is then transcripted). Each such complaint is typically a free-flow narrative text about one or more sales transactions, and is not guaranteed to contain the respective transaction identifiers; instead, it might divulge, by way of context, limited information such as the the store name, a partial list of items bought, the purchase dates, etc. Using this limited information, EROCS discover the potential matches with the transactions present in the sales transactions database and links the given complaint with the matching transactions. Such linkage provides actionable context to a typically fuzzy, free flow narrative which can be profitably exploited in a variety of ways. o
o
In the above example, we can build an automated complaint routing system. Given the transaction automatically linked with the complaint, this system retrieves from the relational database additional information about the transaction (such as type and value of the items purchased, specific promotions availed and the customer’s loyalty level), and routes the complaint to an appropriate department or customer service representative based on the same. Consider a collection of complaints that have been linked to the respective transactions in the relational database. This association can be exploited in OLAP analytics to derive useful information such as regions or product categories that have shown a recent upsurge in complaints.
In addition to the database-centric uses mentioned above, the additional information provided by these linkages can be effectively exploited in entity-based search [15], question answering, document understanding and a host of other related problems in information retrieval [14] and extraction [13].
294
M. Mohania et al.
Paper Organization: We provide the outline of the SCORE approach in section 3-5 and then provide the details of EROCS in Section 6-7. We then show the practicality of these two approaches by outlining various use cases in Section 8 and Section 9 concludes the paper.
3 SCORE Approach The core idea behind SCORE is to compute the context of a SQL query using its result. We first provide a working definition of the context of a query. Based on this definition, we present an algorithm to compute the context from the result of the input query. 3.1 Context of Query Result Consider a query Q on a single table R, and let Q(R) denote the result of the query Q. We model the context of Q as the set of terms that (a) are popular in Q(R), and (b) are rare in R − Q(R). Intuitively, these terms differentiate the query result Q(R) from R − Q(R), the remaining rows in the table. Let NQ(A, t) denote the number of rows in which the term t appears in the column cols(Q) (formally, NQ(A, t) = |σA=t(Q(R))|). Further, let NR(A, t) denote the numA ber of times the term t appears in the column A cols(R) (formally, NR(A, t) = |σA=t(R)|). Then, for a term t A where A cols(Q), SCORE measures the term t’s relevance to the query Q as:
∈
∈
∈
∈
⎛ 1+ | R | − | Q( R) | ⎞ ⎟ TW ( A, t ) = N Q ( A, t ) × log⎜ ⎜ 1 + N ( A, t ) − N ( A, t ) ⎟ R Q ⎠ ⎝ In the above, the first expression measures the popularity of t in Q(R) while the second expression measures the rarity of t in R − Q(R). A straightforward algorithm to compute the context of a query Q is to rank the terms in the query result on the basis of TW(A, t) and pick the top N terms, where N is a user defined parameter. In the next section, we extend this algorithm to look for the context beyond the given query result. 3.2 Exploring beyond the Query Result The algorithm presented in the previous section computes the context of a query from its result. In this section we extend the algorithm to look for the context beyond the given query. A simple way to include more terms of relevance to a particular row in the query result is to look at columns that are present in the underlying tables in the database, but do not appear in the query result because they are projected out for convenience. This is easily incorporated by analyzing the input SQL query and removing the projection constraints upfront. However, the algorithm so far is still limited to the boundaries of tables that, in effect, are decided by how normalized the database schema is. Normalization results in distribution of related information across several tables, connected by foreign-key relationships. For instance, consider a bio-classification
Context Oriented Information Integration
295
database with two tables: Species with one row per distinct species and Genus with one row per distinct genus. The information about the genre of a particular species in the Species table is encoded as a foreign key “pointing” to the corresponding row in the Genus table. In this section, we show how the algorithm presented in the previous section can be extended in order to exploit these relationships among the tables. 3.2.1 Alternative Relationship Directions: Forward vs. Backward Before we get into the details of this extension, it is important to realize that there are two ways in which the foreign key relationships can be exploited. The first, as illustrated above, is by following the foreign-key “pointers” in the forward direction; this takes us from sub-concepts to encompassing super-concepts—for instance, from species to their genus in the bio-classification database. In relational terms, this involves joining with tables referenced by foreign key columns present in tables included in the query. The second way is by following the foreign keys in the backward direction; this takes us from super-concepts to all encompassed sub-concepts—for instance, from genus to all the species in that genus in the bio classification database. In relational terms, this involves joining with tables that contain foreign key columns that reference tables included in the query. Currently SCORE exploits the foreign-key relationships in the forward direction only. This is justified since exploration in the forward direction works well in most target applications. For instance, the Star/Snowflake schema, commonplace in the OLAP/BI environments, consists of a “fact” table with foreign-keys to multiple “dimension” tables, and queries on this schema typically involve the fact table. Exploiting backward relationships is an interesting future work. 3.2.2 Exploring Forward Relationships Moving back to the details of the algorithm, recall that for each row in the query result, we want to follow the foreign key pointers and gather more terms, beyond those present in the query result. In relational terms, this amounts to augmenting the query by adding a foreign-key (left-outer) join with the referenced table. Note that, with this perspective, following foreign keys in the forward direction also has the desirable effect that the extra information is just an appendage to the original query result, which remains untouched. In other words, the original query result can be extracted from the augmented query result by simply reintroducing the projection constraints; formally, if Q is the original query and AQ is the augmented query then . In essence, thus, the algorithm aims to extend the input query (with its projection constraints already removed) by augmenting with other tables in the database reachable through foreign-key relationships. To achieve this, two issues need to be addressed. First, joining with all possible tables reachable through foreign-key relationships might be too expensive, and unnecessary, and we need to select a sub-set. In the rest of this section, we show how SCORE selects the optimal subset from the possible alternatives. Since we need to limit the computation involved, we define a parameter M as the maximum number of tables that can be augmented to the query. Ideally, we would
296
M. Mohania et al.
like to select a subset of M tables on which the query has maximum focus; but before we can do that, we need to quantify the focus of the query on a table. cols(Q) be a foreign-key column in the query, and let R be the table referLet F enced by F. Since each term t in the column F references a single row in R, we can interpret TW(F, t), our measure of the query’s focus on the term t (Section 3.1), as a measure of the query’s focus on the corresponding row in R as well. Now, given two cols(Q) referencing tables RA and RB respectively, we foreign-key columns FA, FB say that the query is more focused on RA than on RB if there exists a term t FA such that the query’s focus on t is more than its focus on any term in FB. The focus of a cols(Q) is accordingly query Q on the table referenced by a foreign-key column F measured using a column weight function CW defined as:
∈
∈
∈
∈
CW ( F ) = max t∈F TW ( F , t ) A simple-minded algorithm to select the optimal subset of tables could thus be to find, by traversing the schema graph, all the tables reachable from the tables already present in the query by foreign-key relationships and pick the M tables with the maximum focus. However, this algorithm is not practical—not only because it is exponential-time in the number of tables, but also because the query’s focus on a table is known only if the corresponding foreign key F ∈ cols(Q) (in other words, we only know the query’s focus on the tables directly referenced by the query, but not on the tables referenced in turn by the foreign-keys in these tables). In SCORE, thus, we follow a greedy strategy wherein we iteratively build up the set of tables to augment with. The algorithm maintains the set S of candidate foreignkey columns in the query as augmented so far (call this the augmented query AQ); since S ⊆ cols(AQ), we know CW(F) for each foreign key column F ∈ S. In each iteration, the algorithm picks F ∈ S with the maximum CW(F) and augments AQ with the table referenced by F; as it does so, it computes the weights of the foreignkey columns added as a result and replaces F in S by these foreign keys. The algorithm is presented formally in Section 3.3. 3.2.3 Scaling the Weights of Added Terms Since the terms in the columns added as a result of the query expansions as discussed above are not part of the original query result, it can be argued that the weights of these terms should be scaled down. Since this is subjective, depending on the way the data is structured and on the specific queries that form part of the workload, we leave the decision to the user (or the DBA) and define a tunable scaling parameter β [0, 1] that scales down the weights of these terms from what they would have been if these columns formed part of the original query. For all the cases we came across, however, it seemed reasonable to take β = 1 (no scaling).
∈
3.3 The Algorithm The discussion so far on how to compute the context of a given query leads to the algorithm in Figure 3. In this section, we describe this algorithm in detail. The procedure QueryContext takes as input the query Q. The first step (line 1) is to remove the projection constraints in Q (ref. Section 3.2). Next, the set S of the initial candidate
Context Oriented Information Integration
297
foreign-key columns is constructed (line 2) and CW is computed for these columns (lines 3-4). The procedure next enters a loop, wherein in each iteration (lines 6-14) it picks the most “promising” candidate foreign-key, joins with the referenced table and computes CW for the foreign-key columns added to AQ as a result, closely following the discussion in Section 3.2.2. The loop exits when no more candidate foreign keys exist, or when M augmentations have already been performed. The terms in AQ’s result are then, as discussed in Section 3.1, ranked according to their term weights (TW), and the top N terms, along with the corresponding column names and their overall weights, are returned as the context of the query Q (lines 15-16). Procedure: QueryContext(Q) Input: Query Q Output: Context of the query Q Begin 1. Let AQ = Q without the project constraints 2. Let S = {A | A is a FK cols(AQ)} 3. For each A S 4. Let CW(A) = maxt∈A TW(A, t) 5. Let n = 0 6. While S ≠ φ and n < M 7. Let n = n + 1 8. Let F = argmaxA∈SCW(A) 9. Let S = S − {F} 10. Let R be the table referenced by F 11. Let AQ = AQ joinF R cols(R) 12. For each FK A 13. Let CW(A) = maxt∈ A TW(A, t) 14. Let S = S {A} 15. Let K = {[t,A,w] | t A,A cols(AQ), w = TW(A, t)} 16. Return top N elements in K based on w End M is the maximum number of augments allowed N is the maximum number of terms allowed in the context
∈
∈
∈
∪
∈ ∈
Fig. 3. Context Computation Algorithm
4 SCORE Implementation In the previous section, we formalized the context computation process in terms of an algorithm. The challenge in efficiently implementing this algorithm (cf. Figure 3) arises primarily due to the interplay between the term weight computations and the query augmentations in the algorithm. Recall that in every iteration, in order to decide which foreign-key column to follow, the algorithm needs the column weights CW(F) of the foreign-key columns F S at that point; however, to compute the same, the algorithm
∈
298
M. Mohania et al.
∈
needs to know all the terms t F and their distribution in the query result, so that it can compute TW(F, t). We considered three alternate implementation approaches. The Brute-Force Approach (BFA) solves this problem by actually executing the augmented query so far before each iteration. Furthermore, after all the augmentations are done, the final query is executed as well and the result used to compute the context. Advantages: Simplicity. Follows the algorithm exactly. Disadvantage: Does not scale. The Simple Histogram-based Approach (SHBA) refines the Brute-Force approach by exploiting histogram-based estimation techniques [23] to avoid expensive query executions. Only the initial query is actually executed; the term-weights and column-weights on the augmented columns in each step are estimated using histograms. Since the quality of the context relies on value correlations across columns, the one-dimensional histograms are not appropriate for these estimations—we need twodimensional histograms. Unfortunately, since DB2 does not support two-dimensional histograms, these histograms are computed in a preprocessing step and stored locally. However, note that the augmentations are essentially foreign-key joins. Thus, in order to estimate the distribution of a column added as a result, we only need the twodimensional histograms for the column against the underlying table’s primary key; the extra space needed to store these histograms is thus not significant. Further, note that since primary keys are involved, the histograms necessarily needs to bucket the primary key values. Advantages: Efficient, scales well. Disadvantages: Term and column weight estimations using histogram might result in different choices in augmentation as compared to the exact brute-force approach. Thus, the returned context may not include some keywords present in the brute-force implementation’s context (“false-negatives”). Furthermore, since the bucketing of the primary keys in the histograms makes relevant values in a bucket indistinguishable from the irrelevant values, the context might include several irrelevant keywords (“false-positives”). The Modified Histogram-based Approach (MHBA) further refines the Simple Histogram-based approach to remove the false-positives problem. The idea is to augment the queries based on estimated term and column weights, just like the Simple Histogram-based approach; but once the final augmented query has been formed, execute it on the database. The context is then computed based on the exact result of this query, eliminating the possibility of false-positives. Advantages: No falsepositives. Efficient, scales well. Note that since histograms are only used for augmentation, for a given table, only the two-dimensional histograms for the foreign-keys in the table against the table’s primary key are needed. Disadvantages: Falsenegatives—the returned context may miss some relevant keywords; however, as we demonstrate in Section 5, the difference is marginal on our benchmark.
5 Score Experimental Study In this section, we present a preliminary experimental study to evaluate the techniques and design decisions proposed above. System Parameters and Configuration. For the purpose of this experimental study, the parameters M, N and β (ref. Section 3) were assigned values 3, 10 and 1.0 respectively. Each histogram consisted of only the top 1000 buckets; the remaining buckets
Context Oriented Information Integration
299
were assumed to be of equal size. The implementations were done in Java with J2SE v1.4.2 (approx. 3000 lines of code), and executed on a Pentium-M 1.5 GHz IBM Thinkpad T40 with 768 MB RAM running Windows XP SP1. The relational database used was IBM DB2 v8.1.5 co-located on the same machine. The page size was 4KB and the associated total bufferpool size was 250 pages (1000KB). The communication between SCORE and DB2 was through JDBC. Dataset. The evaluation was performed on a subset of the Open Music Directory data (http://www.musicmoz.org) containing information about 106 music tracks, 105 records, 105 members and 3 104 bands, each stored in a separate table and linked using foreign keys. To make the schema more complex, we introduced a dummy table and added foreign keys to this table from the other tables. The total size of the dataset was 116 MB.
∗
Workloads. The generated workloads consist of selection queries resulting in a specified number NTRACKS of tracks; for each query, the specific NTRACKS tracks are chosen such that they have a common band (called band queries), or a common record (called record queries). These workloads are generated randomly from the data given NTRACKS as a parameter and have an equal mix of band and record queries. Each generated query is also associated with a target context term—that is, the term of maximum relevance to the query among all the terms in the database. Specifically, each band/record query has the band/record’s name (as mentioned in the corresponding row in the bands/record table) as its target context term. Evaluation Metrics. This study uses the following metrics for measuring context quality and computational overheads. Context Quality Metric. Recall that the target context term tQ is, by design, the term in the entire database most relevant to the query Q. Thus, we expect tQ to be the highest ranked keyword in the context returned by SCORE for each query. We choose a quality metric that quantifies this expectation. Given the query Q, let tQ be its target context term and let C(Q) = [t1, t2, . . . , tN] be the list of N keywords retrieved by SCORE as its context, ranked in decreasing order of their term-weights. We define the Reciprocal Rank (RR) of SCORE for the query Q as:
The Mean Reciprocal Rank (MRR) of SCORE on a query workload S is then defined as
MMR( S ) =
1 ∑ RR(Q) , i.e. the mean RR of SCORE over the queries Q | S | Q∈S
∈ S.
We use this MRR as our quality metric in the experiments in this section. MRR is not a novel metric; in fact, it is the standard metric for the quality of QuestionAnswering systems in IR [16], a closely related problem (ref. Section 2).
300
M. Mohania et al.
Computational Overhead Metric. We measure the computational overheads in terms of the additional time spent on a query when using SCORE as compared to when not using SCORE. EXPERIMENT 1 Purpose. To evaluate efficacy of the query augmentation algorithm and study how augmenting the queries impacts the context quality and computational overheads for BFA, SHBA and MHBA respectively. Methodology. We generated a workload with fixed result size NTRACKS = 32, and for increasing number of augmentations M = 0 (no augmentation), 1, 2, 4 and 8 invoked the BFA, SHBA and MHBA implementations for the workload and computed the respective MRR and the overheads. Result. The quality (MRR) and overhead results are summarized in the following table. # aug M BFA Quality (MMR) BFA Overhead (ms) SHBA Quality (MMR) SHBA Overhead (ms) MHBA Quality (MMR) MHBA Overhead (ms)
0 0.00 5 0.00 7 0.00 5
1 0.47 28 0.10 12 0.47 20
2 0.98 67 0.15 30 0.61 31
43 1.00 171 0.16 87 1.00 54
8 1.00 502 0.16 167 1.00 92
Discussion. Recall that the target context term for a record query is reached after augmentation with the records table, while that for a band query is reached after augmentation with both the records and the bands table. We see that the MRR for BFA improves from 0.00 for no augmentation (expected, since the target context term is not present in the TRACKS table) to 0.47 for M = 1 (most record queries find the target context term) and 0.98 for M = 2 (most record and band queries find their target context term). This implies that for almost all queries in the workload, the augmentations occur in the optimal order, clearly validating the choice of TW and CW, and attesting to the efficacy of the context computation algorithm. However, BFA’s accuracy comes at the price of a high overhead. Comparing with the results for SHBA, we see that using in-memory histogram-based estimations instead of actually querying the data results in drastic reduction of the overheads, but the MRR falls drastically as well. In contrast, the MRR for MHBA is almost the same as that for BFA and, surprisingly, is achieved with even less overhead than SHBA for M > 2. This happens because, as the number of augmentations increase, so do the number of augmented columns; eventually, for M > 2, the overhead for estimating the distributions for all the augmented columns becomes more than the cost of actually evaluating the final augmented query. Overall, for all the approaches, the overheads increase rapidly with increasing number of augmentations; the increase is highest for BFA and the lowest for MHBA. For MHBA, however, the overhead of 92 ms for even M = 8 is not very significant.
Context Oriented Information Integration
301
We thus conclude that MHBA is clearly the implementation of choice, achieving a quality comparable to BFA with a very reasonable overhead even in the presence of a large number of augmentations. EXPERIMENT 2 Purpose. To study how increasing query size impacts the context quality and computational overheads for BFA, SHBA and MHBA respectively. Methodology. We generated workloads with result size NTRACKS = 12, 20, 28, 36, 44 and 52 respectively. For each workload, we invoked each SCORE implementation and computed the respective MRRs and the overheads. (The number of augmentations were fixed at M = 4.) Result. Summarized in the following table. NTRACKS BFA Quality (MMR) BFA Overhead (ms) SHBA Quality (MMR) SHBA Overhead (ms) MHBA Quality (MMR) MHBA Overhead (ms)
12 1.00 52 0.22 86 1.00 27
20 1.00 99 0.21 77 1.00 40
28 1.00 158 0.18 82 1.00 53
36 1.00 171 0.20 82 1.00 63
44 1.00 183 0.21 99 1.00 58
52 1.00 196 0.30 97 1.00 68
Discussion. The results further corroborate our observations in Experiment 1 about the relative context quality determined by the respective approaches and their relative overheads. In these results, we further observe that the overhead for BFA increases rapidly with increasing query size. The overhead for MHBA, in sharp contrast, shows a very slow and roughly linear increase. It is again evident that MHBA is consistently more efficient as compared to SHBA; the overhead for MHBA even for a query size of 52 rows is 68 ms, which is again very reasonable, almost one-third of the corresponding overhead for BFA. These observations clearly validate the design choices made in formulating the MHBA approach.
6 EROCS Approach The second approach for Context Oriented Information Integration is that of EROCS. As mentioned in Section 2, EROCS address the problem of linking a document with related structured data in an external relational database. EROCS views the structured data in the relational database as a set of predefined “entities” and identifies the entities from this set that best match the given document. EROCS further finds embeddings of the identified entities in the document; these embeddings are essentially linkages that inter-relate relevant structured data with segments within the given document.
302
M. Mohania et al.
6.1 Framework In this section, we give details of the models that build up the underlying framework of EROCS. 6.1.1 Entity Model An entity is a “thing” of significance, either real or conceptual, about which the relational database holds information [24]. An entity template specifies (a) the entities to be matched in the document and (b) for each entity, the context information that can be exploited to perform the match. Formally, an entity template is a rooted tree with a designated root node. Each node in this tree is labeled with a table in the given relational database schema, and there exists an edge in the tree only if the tables labeling the nodes at the two ends of the edge have a foreign-key relationship in the database schema. The table that labels the root node is called the pivot table of the entity template, and the tables that label the other nodes are called the context tables. Each row e in the pivot table is identified as an entity belonging to the template, with the associated context information consisting of the rows in the context tables that have a path to row e in the pivot table through one or more foreign-keys covered by the edges in the entity template.
Fig. 4. Example of a Relational Database Schema and an associated Entity Template
For instance, the sales transactions entity template, shown in Figure 4, has the root node labeled by the TRANSACTION table (the pivot table), and the non-root nodes labeled by the CUSTOMER, STORE, TRANSPROD, PRODUCT and MANUFACTURER tables (the context tables) that provide the context for each transaction in the TRANSACTION table. Note that the template definition also provides the information that the SUPPLIER table, though reachable from the TRANSACTION table via both the PRODUCT and STORE tables, carries no contextual information about a given transaction; this is valuable domain knowledge that is hard to figure automatically. Multiple nodes in the template can be labeled with the same table. This is needed to differentiate the different roles a table might play in the context of the entity. Suppose the document mentions product names not only to identify a transaction, but also
Context Oriented Information Integration
303
to identify the store in which the transaction occurred; further suppose it mentions the manufacturer in the former case, but not in the latter. Then, the template in Figure 4 would extend the TRANSACTION→STORE path to TRANSACTION→STORE→ INVENTORY→PRODUCT. Now there exist two nodes in the template labeled with the same table PRODUCT representing the two roles the table plays; also, one includes a child labeled with the table MANUFACTURER, the other does not. Currently, we assume that the entity templates are specified by a domain expert; this is a one-time low-overhead activity that can be a part of the initial customization and configuration. For instance, reverse engineering of existing relational databases to derive the entity-relationship model is a much studied problem [20], and several commercial data modeling tools support this feature (e.g. Microsoft Office Visio [17]); specification of entity templates can be readily integrated with this reverse engineering effort. The entity templates are closely related to view objects defined earlier by Barsalou et al. [18], who also propose information theoretic techniques to automatically identify what tables to include in the view object for a given pivot table [19]. We plan to explore similar techniques to automate the specification of entitytemplates as a future work. Further discussion in this paper assumes only a single entity template is defined. This is only for ease of exposition; the techniques can be readily generalized for a collection of entity templates. 6.1.2 Document Model EROCS views a document as a sequence of sentences, where a sentence is a bag of terms. Some terms in a sentence are potentially useful since they occur in the database as well, and thus may occur in the context of a candidate entity; other terms are not useful and are filtered out. EROCS uses a part-of-speech parser to identify nounphrases in a sentence, and filters out the rest; the assumption, which usually holds, is that only nouns appear as values in the database. Further, each noun thus identified is looked up in the database and annotated with the database columns it occurs in. This filtering and annotation pre-processing reduces the amount of work to be performed in the matching step. This could be further enhanced by (a) incorporating NER techniques [25] that can identify potential matches with the database terms without the database lookups needed currently, and (b) incorporating semantic integration in text [17] to matches the terms in the document to identify whether they belong to the same entity; this dependency information can potentially reduce the number of database queries needed in the current implementation. 6.1.3 Entity-Document Matching Model EROCS defines the weight of a term t as:
where N is the total number of distinct entities in the relational database, and n(t) is the number of distinct entities that contain t in their context.
304
M. Mohania et al.
A segment is a sequence of one or more consecutive sentences in the document. We now discuss how a segment d is scored with respect to an entity e. Let T(d) denote the set of terms that appear in the segment d, and let T(e, d) T(d) denote the set of such terms that appear in the context of e as well. Then, the score of the entity e with respect to the segment d is defined as:
⊆
where tf(t, d) is the number of times the term t appears in the segment d, and w(t) is the weight of the term t as defined above. This definition of score(e, d) is in the spirit of the “tf-idf” scores commonly used in the Information Retrieval literature [26]. 6.2 Identifying Best Matching Entities and Their Embeddings In this section, we first formulate the entity identification and embedding problem and then provide an efficient algorithm for solving the same. 6.2.1 Problem Formulation We are given as input (a) a document D, (b) a relational database, and (c) an entity template that interprets the database as a set of entities E. In this paper, we make the assumption that each sentence in the document D relates to at most one entity; this is a stylistic assumption about the document, and seems to be reasonable in practice. Given this assumption, we can formally model an annotation for D as follows: An annotation for the document D is a pair (S, F) where S is a set of non-overlapping segments of D and F is a mapping that maps each segment in d S to an entity F(d) E. EROCS defines the score of an annotation (S,F) as:
∈
∈
where the entity-document matching score is as defined in Section 6.1.3 and λ ≥ 0 is a tunable parameter; a justification for this parameter will appear in a moment. The problem being addressed can now be stated as follows. Problem Statement. Find an annotation with the maximum score among all annotations of the document D. Before moving on to discuss the solution of this problem in the next section, let us reflect upon the utility of the parameter λ in the above formulation. For a given annotation (S, F), we can think of score(F(d), d) as the support for a given d S. Introducing λ in the annotation scoring function guarantees that in the annotation with the maximum score, no segment will have a support less than λ. In Section 7, we show that setting λ > 0 eliminates a significant amount of irrelevant annotations, leading to improved accuracy. The naive algorithm to solve the proposed problem is to enumerate all annotations, and pick the annotation that has the maximum score. This is clearly impractical since the number of possible annotations is |E||D|, where |E| is the number of entities and |D| is the number of sentences in the document. EROCS solves this problem by first
∈
Context Oriented Information Integration
305
effectively pruning the search space, and then searching on the reduced space using an efficient algorithm; the next two sections give the details of the solution. 6.2.2 Pruning the Search Space In this section, we present some properties of the best annotation of a given document D. These properties will help in effectively pruning the size of the search space, which will be useful in developing an efficient algorithm for finding the best annotation in the next section. We call an annotation (S, F) canonical iff (a) S is a partition of D, and (b) F maps each d S to its best matching entity. Now, consider an annotation (S, F) that is not canonical. If S does not cover some sentences of D, then we can transform the annotation by adding these non-covered sentences to adjacent segments in S. Further, if S contains a segment d such that the associated entity F(d) is not its best matching entity, then we can transform the annotation by taking F(d) as d’s best matching entity instead. Notice that neither of these transformations can possibly decrease the annotation’s score. Thus, we see that any annotation can be converted into a canonical annotation without decreasing its score. It follows that there exists a canonical annotation that achieves the maximum score. We formalize the discussion in the following claim.
∈
Claim 1. For any document D, there exists a canonical annotation (S, F) such that (S, F) is an optimal annotation for D. We can thus restrict the search space to only canonical annotations without any loss in generality. The problem being addressed can now be revised as follows. Revised Problem Statement. Find a canonical annotation with the maximum score among all canonical annotations of the document D. In the next section, we present an efficient algorithm to solve this problem. 6.2.3 Best Annotation Computation In this section, we present an efficient dynamic programming algorithm to find the best canonical annotation for a given document. As argued in the previous section, the restriction to canonical annotations does not result in any loss of generality, and the best canonical annotation computed is guaranteed to be the best annotation overall. Accordingly, we consider only canonical annotations in the rest of this paper. Moreover, since a canonical annotation (S, F) is completely specified by its segment set S, we shall occasionally identify a canonical annotation (S, F) by S alone, and refer to F as the canonical mapping for S. For 1 ≤ i ≤ j ≤ |D|, let Di,j denote the segment in D that starts at the ith sentence and ends at the jth sentence (both inclusive). The segments D1,1,D1,2, . . . , D1,|D|, where D1,|D| is the document D itself, are termed the prefixes of the document D. Now, let Sk be the best annotation for the prefix D1,k and let rk be its score. Further, let ei,j be the best matching entity for the segment Di,j and let si,j be its score. The following claim gives a recurrence relation for rk in terms of rj, for j ≤ k − 1; as the base case for the recurrence, we define r0 to be zero.
306
M. Mohania et al.
Claim 2. For each 1 ≤ k ≤ |D|, the score rk can be recursively expressed as rk = max0≤j≤k−1(rj + sj+1,k − λ). Proof. The claim is easily proved via induction on k. We make use of the fact that (a) rk is the maximum annotation score possible for D1,k, and (b) for any 1 ≤ k ≤ |D|, there must exist a 1 ≤ j ≤ k such that Dj,k appears in the best annotation of the prefix D1,k. Procedure: BestAnnot(D) Input: Document D Output: (best annotation, score) Begin A01 For i=1 to |D| A02 For j=i to |D| A03 Let ei,j = argmaxe∈E score(e,Di,j) A04 Let si,j = score(ei,j,Di,j ) A05 Let S0 = φ A06 Let r0 = 0 A07 For k = 1 to |D| A08 Let j = argmax0≤j≤k−1(rj + sj+1,k − λ) {Dj+1,k} A09 Let Sk = Sj A10 Let rk = rj + sj+1,k − λ S|D| A11 For each d A12 Let F|D|(d) = argmaxe∈E score(e, d) A13 Return ((S|D|, F|D|), r|D|) End
∈
∪
Fig. 5. Best Annotation Computation Algorithm
This recurrence relation forms the basis of the dynamic programming algorithm presented in Figure 5. The algorithm first computes the best matching entities for all the segments in the given document (Lines A01-A04). Then, in accordance with the recurrence relation of Claim 2, it iteratively computes the best annotation for increasingly larger prefixes of the document, making use of these best matching entities and the best annotations of strictly smaller prefixes computed in previous iterations (Lines A05-A10). The algorithm then constructs the canonical mapping F|D| for the computed best annotation S|D| (Lines A11-A12) and returns the pair along with its score r|D| (Line A13). Discussion. The time complexity of the proposed algorithm is quadratic in the number of sentences in the document; this can be reduced to linear if we limit the size of the segments considered to be at most L sentences. However, this efficient algorithm is not enough to make the solution scalable. Finding the entity in E that best matches a given segment (Line A03) involves a search (rather than a simple lookup) on the database; this is an expensive operation for nontrivial database sizes, and performing it for every segment in the document is clearly a performance bottleneck. In the next section, we describe how EROCS resolves this critical scalability issue.
Context Oriented Information Integration
307
6.3 The Context Cache As discussed in the previous section, the operation of finding the entity with the best score for a given document segment is an expensive operation. This operation needs to be performed for every segment in the document; if naively done, this is likely to be a severe performance bottleneck. Since a document is not likely to have repeated segments, caching the result of the operation is not effective. Moreover, the large number of candidate entities, and the size of context information associated with each entity, makes it impractical to apriori materialize and index the entire context of each candidate entity; this materialization would involve expensive joins across multiple tables, and thus would have high computation, maintenance and storage overheads. In this section and the next, we show how EROCS resolves this critical issue. EROCS uses an entity-term association cache, formally referred to as the context cache, to reduce the database access overhead. This cache can be visualized as a collection of relationships of the form (e, t) meaning that the term t is contained in the context of the entity e. This cache is indexed both on entities as well as terms. In Section 6.3.1, we describe how this cache is populated. A naive, conventional use of the cache would be in eliminating the overheads of repeated database accesses. The crux of EROCS’s algorithms lies in more sophisticated use of the contents of this cache to reduce the number of database accesses. At the core of these optimizations, which will be discussed in the next section, lie techniques that exploit the contents of the cache to bound the entity-segment matching scores as well as the annotation scores. These techniques are formally described in Section 6.3.2. 6.3.1 Context Cache Population The algorithms in this paper access the relational database via the following two operations. o o
GetEntities(t): Given a term t appearing in D, this operation queries the database and returns the set of all entities that contain the term t in their context. GetTerms(e): Given an entity e, this operation queries the database and returns the set of all terms from D that are contained in the context of e.
GetEntities involves (a) identifying the rows containing the term t across all tables labeling the nodes in the entity, and (b) identifying the rows in the pivot table that have a join path (along the edges in the entity template) to any of the identified rows. Step (a) is performed using a text index over the tables in the database, while step (b) involves a union of multiple join queries, one for each node whose labeling table contains a row that contains the term t. Our current implementation exploits DB2 Net Search Extender [27] for combined execution of both steps in a single query. Computing the context of an entity in GetTerms, on the other hand, involves a join query based on the entity template. However, in presence of nested substructure, it is sometimes more efficient to retrieve the context using an outer-union query; such issues are well-known in the XML literature [25]. Clearly, both these operations are expensive, and caching their results makes sense. The result of GetEntities(t) for a term t is cached by inserting the pair (e, t) for each
308
M. Mohania et al.
entity e returned in the result. Similarly, the result of GetTerms(e) for an entity e is cached by inserting the pair (e, t) for each term t returned in the result. 6.3.2 Context Cache-Based Score Bounds In this section, we show how the contents of the cache at any given point can be used to obtain tight upper and lower bounds on the entity-segment matching scores as well as the annotation scores. Consider a document segment d and let T(d) be the set of terms in d. Further, let TC(d) T(d) denote the set of terms in d for which GetEntities has been invoked so far. Now, consider an entity e E and, as in Section 6.1.3, let T(e, d) T(d) denote the set of terms in the document that appear in the context of e. To compute score(e, d) exactly, we need to know T(e, d). There are two cases, based on whether or not GetTerms has been invoked on e earlier. If GetTerms has not been invoked on e earlier, then our knowledge of the context of e is limited to the set of terms which were found to be in the context of the entity e by virtue of having invoked GetEntities on them in the past; this set, T(e, d) ∩ TC(d), can be obtained using an index lookup on the cache using e. We use this information to get a lower bound on score(e, d) as:
⊆
∈
Further, defining
⊆
WC (d ) = ∑t∈T ( d ) −T
d) − TC(d)
(d )
tf (t , d ).w(t ) and using the fact that T(e,
⊆ T(d) − T (d), we also have the following upper bound on score(e, d): C
C
On the other hand, if GetTerms has been invoked on e earlier, then T(e, d) (which is the set obtained as a result of that invocation) is available in the cache; in this case, we can compute score(e, d) exactly and thus have:
∈
E and segment d in document D derived The bounds on score(e, d) for entity e above can be used to derive a lower bound score−C(S, F) and an upper bound score+C(S, F) for a given annotation (S, F) of D. These bounds follow trivially from the definition of score(F,B) in Section 6.2.1 and are as follows.
Context Oriented Information Integration
309
In the next section, we show how these bounds can be gainfully used to effectively reduce the number of database access operations that need to be invoked in course of best annotation computation. 6.4 Context Cache-Based Best Annotation Computation In this section, we present algorithms to compute the best annotation for a given document D that make effective use of the context cache contents to reduce database access operations. The algorithm AllTerms, presented in Section 6.4.1 uses the cache in a conventional manner to eliminate the overheads of repeated database accesses; this algorithm forms our baseline. Next, in Section 6.4.2, we present the algorithm AllSegments that reduces the number of GetEntities invocations while computing the best entity for a segment. Finally, in Section 6.4.3, we present the algorithm that additionally uses a greedy cache refinement strategy to rapidly converge to the best annotation; this algorithm is actually used in EROCS and is therefore called the EROCS algorithm. 6.4.1 Eliminating Repeated Database Access The most straightforward use of this cache is to eliminate repeated invocations of GetEntities(t) or GetTerms(e) for the same term t or the same entity e respectively. This suggests an algorithm that first invokes GetEntities(t) for each term t in the document, and populates the cache with the pairs (e, t) for each e in the result. After this cache population step, the index on entities is used to determine, for each entity in the cache, the terms in the document that belong to that entity. Using this information, the algorithm readily computes the best matching entity for each document segment; once this is done, BestAnnot is invoked to compute the best annotation for the document. We call this algorithm AllTerms. Despite eliminating repeated invocations, AllTerms does not scale well (cf. Section 7). The reason is that there exist several terms in the document that appear in the context of a large number of entities. Invoking GetEntities on such terms is excessively expensive. Moreover, being low weight, these terms do not contribute much to the score of any entity in case they appear only a few times in the document. In the next section, we develop an optimization that can potentially avoid invocations of GetEntities on such terms. 6.4.2 Reducing GetEntities Invocations Consider any algorithm for finding the best matching entity for the given segment d. Starting with no prior information about the best entity, the search space for this algorithm is the entire entity set E. The purpose of invoking GetEntities for terms in d is to build up a set of entities, hopefully much smaller in size than E, that would form a reduced search space in the algorithm’s quest for the best matching entity; for the algorithm to be correct, however, it needs to ensure that this reduced search space contains the entity being sought. The algorithm AllTerms follows a naive, conservative approach and invokes GetEntities for all terms in d — it thus builds up the set of all entities that are potentially relevant to d; clearly, the best matching entity is guaranteed to be in this set. Next, we present an optimization that can be used to build the search space in a
310
M. Mohania et al.
manner that does not require GetEntities to be invoked on all terms in d, but at the same time ensures that the best matching entity is included in this search space.
∈
6.4.2.1 Term Pruning Strategy. Let EC(d) be the set of entities such that each e EC(d) has appeared in the result of at least one GetEntities invoked so far for terms in the segment d. We call the cache complete with respect to the segment d iff the best matching entity for d is guaranteed to be present in EC(d). T(d) in The idea is to populate the cache by invoking GetEntities on the terms t decreasing order of tf(t, d).w(t), stopping as soon as the cache becomes complete with respect to d. The challenge in implementing this optimization lies in efficiently checking for completeness of the cache at any point. Next, we present an efficient criteria for the same. E − EC(d). By definition, we have T(e, d) ∩ TC(d) = φ, which Consider any e implies score−C(e, d) = 0 and, thus, score+C(e, d) = WC(d) based on the bounding EC(d) such that analysis of Section 6.3.2. Now, if there exists an entity e’ score−C(e’, d)>WC(d) then, clearly, score(e, d) < score(e’, d) and therefore e cannot be the best matching entity for d. We have thus proved the following:
∈
∈
∈
Claim 3. The context cache is complete with respect to the segment d if there exists EC(d) such that score−C(e’, d)>WC(d). an entity e’
∈
Since both EC(d) and WC(d) can be progressively maintained as the cache is being populated, this criterion can be checked efficiently. 6.4.2.2 The AllSegments Algorithm. We now present an algorithm, called AllSegments, that applies the optimization discussed above while computing the best matching entities for each segment in the document. For each segment d in the document, the algorithm successively invokes GetEntities on the terms t T(d) in decreasing order of tf(t, d).w(t). The algorithm progressively maintains the entities in EC(d) in a heap, max-ordered on score−C(e, d), and also maintains WC(d) during the course of these invocations. As soon as the top entity in the EC(d) heap has a value greater than WC(d), the cache is flagged as complete with respect to d. When this happens, the algorithm stops invoking GetEntities on any more terms, and moves to the task of finding the best matching entity from the set EC(d). EC(d) in a heap, max-ordered on At this point, we already have the entities in e score−C(e, d). The algorithm proceeds by removing the topmost entity e from this heap, invoking GetTerms(e), and using the result find its exact score score(e, d). The algorithm repeats the above process and maintains the entity e that has maximum score so far, stopping when the topmost entity e’ is such that score+C(e’, d) < score(e, d). It is easy to see that, at this point, e is the best matching entity for d. The algorithm starts with an empty cache, but carries the cache over while computing the best matching entity across different segments. When the best matching entities for every segment in the document have been computed, the algorithm invokes BestAnnot to compute the best annotation for the document.
∈
∈
Context Oriented Information Integration
311
6.4.3 Greedy Iterative Cache Refinement In this section, we present the EROCS algorithm that uses a greedy iterative cache refinement strategy to converge to the best annotation for the given document without computing the best entities for all segments. The greedy iterative strategy for computing the best annotations is presented in Section 6.4.3.1. A cache refinement heuristic that attempts to refine the cache in a way that this greedy iterative strategy converges to the best annotation rapidly is presented in Section 6.4.3.2. 6.4.3.1 Greedy Iterative Strategy. Let us revisit the procedure BestAnnot outlined in Figure 5 that computes the best annotation (S*, F*) of a given document D. We modify the procedure BestAnnot so that Lines A03, A04 and A12 invoke the score upper-bound function score+C(e, d) instead of the exact score(e, d). Let us call this modified procedure BestAnnotC. Procedure: BestAnnotErocs(D) Input: Document D Output: (best annotation, score) Begin B01 Initialize the context cache as empty B02 Let (( ¯ S, ¯ F), s) = BestAnnotC(D) B03 While score−C( ¯ S, ¯ F) < score+C( ¯ S, ¯ F) B04 Call UpdateCache( ¯ S, ¯ F) B05 Let (( ¯ S, ¯ F), s) = BestAnnotC(D) B06 Return (( ¯ S, ¯ F), s) End Fig. 6. The EROCS Algorithm
Let ( ¯ S, ¯ F) be the annotation returned by BestAnnotC. We make the following claim. Claim 4: score−C( ¯ S, ¯ F) ≤ score(S*, F*) ≤ score+C( ¯ S, ¯ F) Proof. The first inequality follows from score−C( ¯ S, ¯ F) ≤score( ¯ S, ¯ F) and score( ¯ S, ¯ F) ≤ score(S*, F*). Further, since BestAnnotC overestimates the score of every segment in the document, the best score it computes must be an overestimate of the actual best score; this gives us the second inequality. This result suggests a greedy strategy that iteratively improves the cache contents so that the slack between the upper and lower bound scores of the successive best annotations ( ¯ S, ¯ F) computed by BestAnnotC decreases with every iteration. The resulting procedure, called BestAnnotErocs, appears in Figure 6. Starting with an empty cache, it repeatedly calls BestAnnotC, which computes a best matching annotation ( ¯ S, ¯ F) based on the current cache, and then calls the subroutine UpdateCache (discussed later in Section 6.4.3.2), that updates the cache using ( ¯ S, ¯ F). The procedure terminates whenever we find that the best annotation returned by
312
M. Mohania et al.
BestAnnotC has score+C( ¯ S, ¯ F) = score−C( ¯ S, ¯ F); at this point, by Claim 4, we know that ( ¯ S, ¯ F) is the best annotation. Since score−C(e, d) and score+C(e, d) are computed based on the contents in the context-cache, each invocation to BestAnnotC can be executed efficiently. In fact, since score+C(d, e) for most segments d and entities e remains the same across successive invocations of BestAnnotC, EROCS actually uses lazy, incremental techniques in BestAnnotC to compute the successive best annotations efficiently. 6.4.3.2 Cache Refinement. For a given annotation (S, F), let us define slackC(S, F) = score+C(S, F) − score−C(S, F). Let ( ¯ S1, ¯ F1) and (¯ S2, ¯ F2) be the best annotations returned by BestAnnotC on two successive invocations in the course of the algorithm. The goal of the cache refinement strategy, to be developed in this section, is to choose a cache update at the intervening cache refinement step such that the decrease in slack, slackC( ¯ S1, ¯ F1) − slackC( ¯ S2, ¯ F2), is maximized. To start with, let us assume that the only cache update operation allowed is GetEnT(D) − TC(D) for which GetEntities. Then, the task reduces to finding the term t* tities should be invoked next; here T(D) is the set of terms in the document D, and TC(D) is the set of terms in D for which GetEntities has already been invoked. Now, given our constraint, the state of the cache at each iteration would be such that it has been populated using only invocations of GetEntities. It is easy to show that in such a state, for all annotations (S, F) of the document D, we have
∈
slack C ( S , F ) = WC ( D) = ∑t∈T ( D ) −T
C
(D)
tf (t , D).w(t*) (cf. Section 6.3.2).
Clearly, the difference in slack between (¯ S1, ¯ F1) and ( ¯ S2, ¯ F2) is then always equal to tf(t*,D).w(t*), where t* is the term picked by our cache refinement strategy. This gives us an optimal cache refinement strategy: at each cache refinement step, invoke GetEntities on the term t* T(D)− TC(D) with the maximum tf(t*,D).w(t*). Alternatively, let us assume that the cache is complete with respect to each segment in the document D, and the only cache update operation allowed is GetTerms. E − E(D) for which GetTerms Then, the task reduces to finding the entity e* should be invoked next; here, E(D) is the set of entities for which GetTerms has already been invoked. However, now no obvious global property of the kind seen above holds, and we need to devise heuristics, taking cue from the current best annotation (¯ S1, ¯ F1). Instead of trying to maximize the slack decrease across two successive optimal annotations, we make the assumption that the current annotation (¯ S1,¯ F1) would remain the optimal even after the update. Under this assumption, the goal of the cache refinement strategy is to find the entity e* such that invoking GetTerms on this entity maximizes the slack decrease for the current best annotation (¯ S1, ¯ F1). For a given segment d and entity e, define slackC(e, d) = score+C(e, d)−score−C(e,
∈
∈
d). Then, we have
slack C ( ¯ S1 , ¯ F1 ) = ∑d∈¯ S slack C ( ¯ F1 (d ), d ). Clearly, a 1
principled way to maximize the reduction of slackC(¯ S1, ¯ F1) is to maximize the reduction of the largest slackC(¯ F1(d), d) in the summation. This leads to the following cache refinement strategy: at each cache refinement step, invoke GetTerms on the best matching entity for the segment in ¯ S1 with the maximum slack. In EROCS, we follow a cache refinement strategy that is a hybrid of the two. Specifically, at each cache refinement step, it checks whether the cache is complete with
Context Oriented Information Integration
313
respect to all the segments in the current best annotation (¯ S, ¯ F). If this is false, it T(D) − TC(D) with the maximum tf(t ,D).w(t ); invokes GetEntities on the term t* if this is true, it invokes GetTerms on the best matching entity for the segment in ¯ S with the maximum slack. This strategy appears formally as the procedure UpdateCache in Figure 7.
∈
Procedure: UpdateCache( ¯ S, ¯ F) Input: current best annotation Begin C01 If the current cache is complete with respect to each d C02 Let d* = argmaxd∈ ¯ S slackC( ¯ F(d), d) C03 Call GetTerms( ¯ F (d )) C04 Else C05 Let t = argmaxt T(D)−TC(D) tf(t,D).w(t) C06 Call GetEntities(t*) End
∗
∈
*
*
∈ ¯S
*
Fig. 7. Cache Refinement Algorithm
Until the point the cache becomes complete with respect to at least one segment in the document, this strategy enforces the GetEntities-only constraint and makes decisions that are the best with respect to that constraint. After the point when the cache is complete with respect to all the segments in the document, it enforces the GetTermsonly constraint and makes decisions that are (heuristically) the best with respect to that constraint. In general, the strategy favors GetEntities initially and GetTerms later, which is justified since the initial terms contribute more towards decreasing the slack than the latter terms. In Section 7, we show that this hybrid strategy works well in practice. As a part of our future work, we plan to address the problem of finding a provably good strategy for cache refinement.
7 EROCS Experimental Study The techniques proposed in Section 6 are based on two main contributions. The first contribution is the proposition that entity identification should be done at a fine grained level, i.e. by matching entities with segments within the document, rather than simply matching with the entire document. The second contribution is the efficient implementation of this proposition as a greedy iterative cache refinement strategy. In this section, after discussing the experimental setup, we evaluate the efficacy of finegrained matching (Section 7.1) and of greedy iterative cache refinement (Section 7.2) against less sophisticated alternatives. Next, we study the effect of varying the value of the parameter λ on the accuracy of the techniques (Section 7.3). Finally, we study the improvement in the accuracy of the current best annotation as the greedy iterative cache refinement progresses (Section 7.4).
314
M. Mohania et al.
Platform. The implementation was done in Java with J2SE v1.4.2 (approx. 2000 lines of code) and executed on a 2.4 GHz Machine with 4GB RAM running Windows XP SP1. The relational DBMS used was IBM DB2 UDB v8.1.5 co-located on the same machine. Communication between EROCS and DB2 was through JDBC. We used the IBM DB2 UDB Net Search Extender v8.1 as our text index over the database; this was because of convenience and easy availability. A more specialized text index on the database, we believe, would achieve even better performance than that reported in this section. Structured Dataset. This study used a subset of the Internet Movie Database, with movies as the entities of interest. The database contains roughly 4 million records and has size 2GB across eight tables. The Movies table (401660 rows) contains names of movies and the Persons table (287398 rows) contains a list of persons along with their names. The remaining six tables relate rows in these two tables. The Actors (1619647 rows) and Actresses (776396 rows) tables relate the movies with their cast along with the corresponding character names. The Directors (244394 rows), Producers (150838 rows), Writers (232420 rows) and Editors (46795 rows) tables relate the movies with their designated crew. The entity template is defined so that the Movies table is the pivot table (the total number of entities are therefore 401660) and other tables are the context tables. Document Dataset. The documents used in this study were assorted movie reviews downloaded from the Greatest Films website. These reviews were processed using a part-of-speech tagger, and the noun-phrases identified as the relevant terms. We removed the names of the movies in the review text, but retained this information separately for evaluation purposes. To control the quality and have multiple entities in the documents, we decomposed these base documents into segments of approximately 8 sentences each, and classified each segment as good or bad based on the average weight of terms contained in the segment. This gave us a set of good and a set of bad documents for each movie. Now, given the number of entities per document (K) and the fraction of good segments (α) as parameters, we generated a random document by picking a random sequence of K distinct movies, and for each movie in the sequence, including a good segment with probability α and a bad segment with probability (1 − α). Note that the length of the document is a multiple of K, and is not considered as a separate parameter. Our final repository of documents included 50 documents for each combination of K = 1, 2, . . . , 10 and α = 0.0, 0.1, . . . , 1.0. Parameter Settings. EROCS has only one parameter, λ; unless otherwise stated, its value is fixed at 4. In Section 7.3, we justify this choice, and also study the effect of varying the value of λ. Accuracy Metric. We use an annotation accuracy measure that not only considers the accuracy of the set of entities identified, but also the accuracy of the embeddings of these entities in the document as specified by the annotation. Let e be the entity a given sentence actually belongs to, and let E’ be the set of entities that best match the segment containing this sentence according to the given annotation. Then, precision and recall for this sentence are computed as |E’∩{e}|/|E’| and |E’∩{e}| respectively.
Context Oriented Information Integration
315
The accuracy of the annotation for the document is then computed as the harmonic mean of the average precision and recall over the sentences in the document. 7.1 Efficacy of Fine-Grained Entity Matching In this experiment, we show that EROCS’s strategy for identifying entities that bestmatch the document at a fine grained, individual segment level leads to greater accuracy than a technique that simply identifies the entities that best match the entire document. For the sake of this experiment, assume that the number of entities (K) in the given document is known apriori. Given this information, our alternative algorithm (called TopK) picks all the entities with top K matching scores with respect to the entire document considered as a single segment, and then uses BestAnnotC to compute the best annotation considering only these entities. We compared the accuracy of EROCS and TopK for documents with varying quality (α), as well as varying entities contained within (K). We first fixed α = 0.8 and varied K = 1, 2, . . . , 10, and then fixed K = 10 and varied α = 0.0, 0.1, . . . , 1.0. For each combination, we computed the average accuracy of each algorithm for the 50 documents in the repository for that combination. The results are plotted in Figure 8 and Figure 9 respectively.
Fig. 8. Accuracy of EROCS, TopK for varying K
316
M. Mohania et al.
Fig. 9. Accuracy of EROCS, TopK for varying ψ
Discussion. We discuss the results in Figure 8 first. When the document has exactly one entity, the two algorithms are identical, and both are able to identify the best entity perfectly. However, for document with multiple entities, TopK’s accuracy falls drastically. This is because in TopK considers the entire document as a single segment. As a result, unrelated terms in different, well separated parts of the document interfere, leading to irrelevant entities being scored high. EROCS, in contrast, compares the best score for the entire document as a single segment with the best score for each possible partitioning of the document; this allows it to exploit the substructure within the document to filter out such irrelevant entities. From the figure, we see that EROCS is able to maintain an accuracy close to 0.8 with increasing K, whereas the accuracy of TopK deteriorates to about 0.3 for the same documents. Next, consider the results in Figure 9. The increase in accuracy for both the algorithms with increasing α is because documents with higher α have better clues in terms of higher weight terms. However, the significant gap between the accuracy of EROCS and TopK persists irrespective of the quality of the documents. Even for the lowest-quality documents considered, i.e. α = 0, EROCS was able to achieve an accuracy of 0.45; in contrast, for the highest-quality documents considered, i.e. α = 1, TopK could merely achieve an accuracy of 0.36. Overall, these experimental results clearly illustrate that the fine-grained, segmentlevel entity matching strategy proposed in this paper has significant advantage in terms of accuracy over the simpler-minded alternative.
Context Oriented Information Integration
317
7.2 Efficacy of Greedy Iterative Cache Refinement In the previous section, we established the need for fine grained, segment-level entity matching. The algorithm proposed in Section 6.2.3 is clearly an efficient algorithm in terms of complexity for performing the same; in this section, we validate the efficacy of greedy iterative cache refinement (Section 6.4.3) as an implementation strategy for the proposed algorithm. The validation is against the following alternatives that were used to motivate the strategy in Section 6.4. o o
AllTerms, that does not exploit term-pruning (Section 6.4.1). AllSegments, that exploits term-pruning, but does not exploit greedy cacherefinement (Section 6.4.2).
We compare EROCS, which uses greedy iterative cache refinement, with these two alternatives with respect to execution efficiency and space overheads. The execution efficiency was measured in terms of clock time while the space overheads were measured in terms of the number of (entity, term) pairs computed and maintained in the cache. As in the previous experiment, we first fixed α = 0.8 and varied K = 1, 2, . . . , 10, and then fixed K = 10 and varied α = 0.0, 0.1, . . . , 1.0. For each combination, we computed the average time and space overheads of each algorithm over the 50 documents in the repository for that combination. The execution efficiency and space overhead results are plotted for varying K in Figure 10 and Figure 11, and for varying α in Figure 12 and Figure 13 respectively.
Fig. 10. Execution Efficiency of AllTerms, AllSegments, EROCS for varying K
318
M. Mohania et al.
Fig. 11. Space Overhead of AllTerms, AllSegments, EROCS for varying K
Fig. 12. Execution Efficiency of AllTerms, AllSegments EROCS for varying α
Context Oriented Information Integration
319
Fig. 13. Space Overhead of AllTerms, AllSegments,EROCS for varying α
Discussion. Figure 10 shows that for all values of K considered, EROCS achieves at least 60% reduction in execution time over its nearest competitor, AllSegments; this large gap between EROCS and AllSegments shows that the greedy cache-refinement strategy is indeed effective in reducing the number of segments for which the scores were computed exactly. Notice that EROCS takes 0.55s for K = 1 and 8.6s for K = 10, scaling linearly with increasing K; this implies a sub-second execution time for each additional entity, which is encouraging. The gap between AllSegments and AllTerm is relatively small, and decreases as K increases. This is because an increase in K implies increase in document size, which leads to an increase in the number of segments for which AllSegments needs to compute the exact score; as a result, any gains AllSegments has over AllTerms due to the term-pruning optimization become negligible as K increases. Figure 11 compares the space overheads of the three algorithms for varying K. It is interesting to see the rapid rate at which the space overhead for AllTerms grows with increasing K – from 9111 at K = 1 to 142211 at K = 10. This is because AllTerms gets the entity set for each distinct term in the document, and maintains this information in the cache. This includes even low weight terms that occur in a large fraction of the entities in the database; moreover, many of these entities have only this one word in the document. AllTerms thus results in a large number of (entity, term) relationships in the cache, a majority of them unnecessary. AllSegments and EROCS, in contrast, exploit the term-pruning optimization for each segment, avoiding terms with lower weight. For AllSegments, this optimization results in a highly reduced space overhead of 3398 at K = 1 and 37992 at K = 10. EROCS further exploits the greedy
320
M. Mohania et al.
cache-refinement strategy to avoid computing exact scores for several segments, resulting in space overhead of a mere 505 at K = 1 (a reduction of almost 95% over AllTerms and 85% over AllSegments), and 10181 at K = 10 (a reduction of almost 93% over AllTerms and 73% over AllSegments). Moreover, for EROCS, these overheads scale up linearly, with a slope of about 1000 per additional entity, which is reasonable. In Figure 12, we notice that with increasing document quality (α), the execution time for EROCS and AllSegments decreases, while the execution time for AllTerms increases. Figure 13 shows a similar trend for space overheads as well. This can be explained as follows. As the number of higher weight terms in the document increase, the distribution of weights in most segments becomes more skewed; this can be effectively exploited by the term-pruning optimization. On the other hand, increase in higher weight terms in the document also implies increase in the number of distinct terms in the document (recall that a higher weight term is present in lesser number of entities); since AllTerms necessarily queries all distinct terms, this leads to an increase in its execution time and space overhead. The results in Figure 12 show that AllSegments performs worse than AllTerms at smaller values of α. This happens because our implementation of AllTerms batches multiple invocations of GetEntities together. At smaller values of α, the weights of the terms in the document are low, and thus the term-pruning strategy is not effective. As a result, both AllTerms and AllSegments invoke GetEntities for almost the same number of terms. However, because of the batched implementation, AllTerms is able to perform these invocations more efficiently as compared to AllSegments. For larger α, as discussed above, the term-pruning strategy becomes effective, resulting in improved performance of AllSegments as compared to AllTerms. Overall, the experimental results in this section clearly establish the efficacy of the greedy cache-refinement strategy used in EROCS.
8 Use Cases In this section we provide an overview of the different scenarios in which context oriented information integration can be used in practice. 8.1 Document Sanitization EROCS can be used for sanitization of unstructured documents. Sanitization (syn. redaction) of a document involves removing sensitive information from the document, in order to reduce the document’s classification level, possibly yielding an unclassified document [33]. A document may need to be sanitized for a variety of reasons. Government departments usually need to declassify documents before making them public, for instance, in response to Freedom of Information requests. In hospitals, medical records are sanitized to remove sensitive patent information (patient identity information, diagnoses of deadly diseases, etc.). Document sanitization is also critical to companies who need to prevent malafide or inadvertent disclosure of proprietary information while sharing data with outsourced operations. Traditionally, documents are sanitized manually by qualified reviewers. However, manual sanitization does not scale as the volume of data increases. The US Department of Energy’s OpenNet initiative [34], for instance, needs to sanitize millions of
Context Oriented Information Integration
321
documents each year. Given the amount of effort involved and limited supply of qualified reviewers, this is a tall order. We now present an overview of ERASE (Efficient RedAction for Securing Entities), a system based on EROCS for performing document sanitization automatically. ERASE assumes public knowledge in the form of a database of entities (persons, products, diseases, etc.). Each entity in this database is associated with a set of terms related to the the entity; this set is termed the context of the entity. For instance, the context of a person entity could include the first name, the last name, the day, month and year of birth, the street and city of residence, the employer’s name, spouse’s name, etc. Some of the entities in the database are considered protected; these are the entities that need to be protected against identity disclosure. For instance, in a database of diseases, certain diseases (such as AIDS) can be marked as protected – we are interested in protecting the disclosure of these diseases, it does not matter if the any other disease (such as Influenza) is revealed. The set of protected entities is derived according to the access privileges of the adversary. The set of entities that need to be hidden from the adversary are declared protected. ERASE assumes an adversary that knows nothing about an entity apart from what appears in the entity’s context, and has bounded inference capabilities. Specifically, given a document, the adversary can match the terms present in the document with the terms present in the context of each of the protected entities. If the document contains a group of terms that appear together only in the context of a particular entity, then the adversary gets an indication that the entity is being mentioned in the given document. We term this a disclosure. ERASE attempts to prevent disclosure of protected entities by removing certain terms from the document – these terms need to be selected such that no protected entity can be inferred as being mentioned in the document by matching the remaining terms with the entity database. A simplistic approach is to locate “give-away” phrases in the document and delete them all. To the best of our knowledge, most prior work on document sanitization has followed this approach, and has focused on developing more accurate ways of locating such phrases in the document’s text [38, 36, 40]. We believe this is an over-kill. For instance, in an intelligence report, removing all names, locations, etc. would probably leave the report with no useful content at all. In contrast, ERASE makes an effort to sanitize a document while causing the least distortion to the contents of the document; indeed, this is considered one of the principal requirements of document sanitization [40]. Towards this goal, ERASE uses EROCS to link the document to the database and then identifies the minimum number of terms in the document that need to be removed in order for the document to be sanitized. “ Let’s look at the immediate facts. You have a number of symptoms, namely weight loss, insomnia, sweating, fatigue, digestive problems and headaches. These may or may not be related to sexually transmitted diseases, but you know you have been exposed to gonorrhoea and you know you may have been exposed to hepatitis B and HIV. Your symptoms are significant and need full investigation in the near future. ” Fig. 14. Illustrative Example of ERASE
322
M. Mohania et al.
Illustrative Example: We illustrate our approach using an anecdotal real-life example. We created a database of 2645 diseases obtaining information from the website wrong diagnosis. Each disease is an entity with the associated context consisting of symptoms, tests, treatments and risk factors. The website offers a classification of the diseases. We declared as protected entities the diseases under the following categories: sexual conditions, serious conditions (conditions related to heart, thyroid, kidney, liver, ovary), cancer conditions, and mental conditions. The number of protected entities were 550. We created a document by taking a paragraph out of a communication between a doctor and a patient from another website. The document is shown in Figure 14, where the relevant terms found in the entity contexts are shown in bold. The document was sanitized using ERASE; the terms deleted as a result are shown underlined in Figure 14. Observe that the sanitization has differentiated between the generic symptoms and symptoms specific to the kind of diseases that appear in the protected entity set, and deleted the latter. Furthermore, though sweating is a common symptom associated with hundreds of diseases, the combination of sweating, weight loss and fatigue reveals a protected entity (in this case, HIV) and so, one of these terms must be deleted to sanitize the document; consequently, the system removed the term “sweating”. We note that a different protected entity set might remove a different set of symptoms. Thus ERASE uses EROCS to link the document to the entity database and then removes the parts of the document so as to sanitize it. 8.2 Semantic Search and Business Intelligence EROCS can be used to improve the search capability over structured and unstructured data. Consider a scenario where a bank wants to provide a search interface over the set of emails received from its customers. Such an interface can be useful to find out the pain points of customers and in business intelligence. In the absence of context oriented information integration, consider a scenario where the bank wants to find out the mails sent to them by their high value customers. If we do a regular keywords such using “high value customer”, no email will be returned as a customer will not know the fact that he/she is a high value customer. However using EROCS each of these emails can be linked with the customer profile present in the data warehouse. Once the linking is done we can now jointly search over the unstructured (emails) as well as structured data which is present in the data warehouse. Over such integrated data, if we give the keyword query “high value customer”, we can now easily retrieve all the emails which are sent by high value customer. This is an example of Semantic search. Hence EROCS can be used for enabling semantic search over unstructured data. The linked data can also be used for improved business intelligence over structured and unstructured data. Consider a sample BI query: “Show me the top 4 pain points of my most privileged customers from North region who have reduced their balance by more than 50% in the last quarter”. Such a business intelligence query cannot be answered using only structured or unstructured data. However, post the use of EROCS, we can easily answer such queries.
Context Oriented Information Integration
323
8.3 Churn Prediction Churn prediction is one of the biggest goals of the business intelligence teams in an enterprise. Identifying a customer early in the churn cycle helps the enterprise to provide focused offers to the customer so as to avoid customer churn. Churn prediction algorithm that are used by enterprise today only use the structured data present in the enterprise. This huge drawback of focusing only on the structured data can be addressed by SCORE. Consider an example where a bank wants to find out their credit card customers who are likely to cancel their cards. In order to do this, SCORE can be used to deduce the common features from the set of customers who have already cancelled their cards. This can be done by first running EROCS on the structured and unstructured data such as customer emails, transcribed phone calls etc. and storing the linked information in a data warehouse. We can then run SCORE on the result of the following query (which is executed in the linked data): “List all customers who have cancelled their credit card in the last 3 months”. SCORE might return the following keywords as the context of the query result: Context 1. CardType: Gold, Category: High Interest, State: CA, No_of_Complaints: >2, Sentiments: Unhappy, Band: 5 Context 2. CardType: Silver, Category: Late Payment, State: CA, No_of_Complaints: >2, Sentiments: Unhappy, Band: 4 “List all customers cancelled their credit cards in last 3 months”
RDBMS
SCORE C2
C3
A
X
1
A
X
2
A
Y
3
B
X
4
A
X
5
B
Y
6
C1
C2
C3
B
X
7
A
X
1
B
X
8
A
X
2 C2 3
A B A
C1 A A
Y X X
X X
4 5
1 2
A B Y
X6
4
B
A X
X7
5
Y
Other customers having the same context
C3
B B
GUI
C1
3
B X
Y8
6
B
X
7
B
X
8
“CardType: Gold”, “Category: High Interest” “State: CA”, “No_of_Complaints: >2”, “Sentiments: Unhappy”, “Band: 5”,
“CardType: Silver”, “Category: Late Payment” “State: CA”, “No_of_Complaints: >2”, “Sentiments: Unhappy”, “Band: 4”,
EROCS
Content Repository
Fig. 15. Use of SCORE in Churn prediction
Notice that the context spans both structured as well as unstructured data. This context tells that people from CA who have a gold credit card and have sent multiple unhappy emails are the ones who cancelled their credit card. Thus people who show similar attributes are more likely to cancel their cards. The enterprise can then find customers who have similar attributes and ensure that their complaints are addressed so as to reduce the customer churn.
324
M. Mohania et al.
9 Conclusions In this paper we have presented the concept of context oriented information integration. The key differentiator of context oriented information integration from those presented in the literature is the fact that the user does not need to provide details about the schema for doing the integration. This is especially useful when dealing with unstructured data. We presented two techniques for doing context oriented information integration – SCORE and EROCS. SCORE associates unstructured content with structured database query results. It uses an efficient algorithm to compute the context of a SQL query by analyzing the query result as well as related information elsewhere in the database. The computed context is used to retrieve relevant unstructured content and associated with the result of the given SQL query. The second technique EROCS, on the other hand, inter-links information across structured databases and documents. EROCS uses efficient techniques for identification and embedding of entities relevant to a given document. This involved principled development of an effective successive approximation algorithm that tries to keep the amount of information retrieved from the database in course of the computation as small as possible. The linkages discovered by EROCS can be profitably exploited in several applications in the database-centric as well as IR-centric domains, and may serve a purpose of bridging the two. We showed the practical use of the two techniques in areas such as document sanitization, semantic search, business intelligence and churn prediction.
References 1. Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B., Gokhale, C., Huang, J., Shen, W., Vuong, B.: Information extraction challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2009) 2. Bruce, H., Halevy, A., Jones, W., Pratt, W., Shapiro, L., Suciu, D.: Information retrieval and databases: Synergies and syntheses (2003), http://www2.cs.washington.edu/nsf2003 3. Hamilton, J., Nayak, T.: Microsoft SQL Server Full-Text Search. IEEE Data Engg. Bull. 24(4) (2001) 4. Jhingran, A., Mattos, N., Pirahesh, H.: Information integration: A research agenda. IBM Sys. J. 41(4) (2002) 5. Dixon, P.: Basics of Oracle Text Retrieval. IEEE Data Engg. Bull. 24(4) (2001) 6. Maier, A., Simmen, D.: DB2 Optimization in Support of Full Text Search. IEEE Data Engg. Bull. 24(4) (2001) 7. Somani, A., Choy, D., Kleewein, J.C.: Bringing together content and data management: Challenges and opportunities. IBM Sys. J. 41(4) (2002) 8. Raghavan, P.: Structured and unstructured search in enterprises. IEEE Data Engg. Bull. 24(4) (2001) 9. Goldman, R., Widom, J.: WSQ/DSQ: A Practical Approach for Combined Querying of Databases and the Web. In: SIGMOD (2000) 10. Maier, A., Simmen, D.: DB2 Optimization in Support of Full Text Search. IEEE Data Engg. Bull. 24(4) (2001)
Context Oriented Information Integration
325
11. Roy, P., Mohania, M.K., Bamba, B., Raman, S.: Towards automatic association of relevant unstructured content with structured query results. In: CIKM 2005 (2005) 12. Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficiently Linking Text Documents with Relevant Structured Information. In: VLDB 2006 (2006) 13. Sarawagi, S.: Automation in information extraction and integration (tutorial). In: VLDB (2002) 14. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley/ACM (1999) 15. Chakrabarti, S.: Breaking through the syntax barrier: Searching with entities and relations. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS, vol. 3202, pp. 9–16. Springer, Heidelberg (2004) 16. Voorhees, E., Tice, D.: The TREC-8 question answering track evaluation. In: Proc. Eighth Text Retrieval Conference, TREC-8 (1999) 17. Walker, M.H., Eaton, N.J.: Microsoft Office Visio 2003 Inside Out. Microsoft Press (2003) 18. Barsalou, T., Keller, A.M., Siambela, N., Wiederhold, G.: Updating relational databases through object-based views. In: SIGMOD (1991) 19. Barsalou, T.: View objects for relational databases. Tech. Rep. STAN-CS-90-1310, CS Dept., Stanford University, Ph.D. thesis (1990) 20. Premerlani, W.J., Blaha, M.R.: An Approach for Reverse Engineering of Relational Databases. CACM 37(5) (1994) 21. Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: SIGKDD (2004) 22. Li, X., Morie, P., Roth, D.: Semantic Integration in Text: From Ambiguous Names to Identifiable Entities. AI Magazine: Special Issue on Semantic Integration (2005) 23. Poosala, V.: Histogram-based estimation techniques in database systems. PhD thesis, University of Wisconsin, Madison, WI, USA (1997) 24. Chen, P.P.-S.: The Entity-Relationship Model–Toward a Unified View of Data. ACM TODS 1(1) (1976) 25. Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton, J.: Relational databases for querying XML documents: Limitations and opportunities. In: VLDB (1999) 26. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley/ACM (1999) 27. IBM. IBM DB2 UDB Net Search Extender : Administration and User Guide (version 8.1) (2003) 28. Business case for content scorecarding, http://www.analyticstrategy.com/research/ Content%20Scorecarding%20Business%20Case.pdf 29. Aditya, B., Bhalotia, G., Chakrabarti, S., Hulgeri, A., Nakhe, C., Parag, S.S.: BANKS: Browsing and Keyword Searching in Relational Databases. In: VLDB 2002, pp. 1083– 1086 (2002) 30. Roy, S.B., Wang, H., Das, G., Nambiar, U., Mohania, M.K.: Minimum-effort driven dynamic faceted search in structured databases. In: CIKM 2008, pp. 13–22 (2008) 31. Call Center use Survey, http://www.incoming.com/statistics/performance.aspx 32. Soltau, H., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Zweig, G.: The IBM 2004 Coversational Telephony System for Rich Transcription. In: IEEE ICASSP (March 2005) 33. Wikipedia. Sanitization (classified information) — wikipedia, the free encyclopedia (2006)
326
M. Mohania et al.
34. U.S. Department of Energy. Department of energy researches use of advanced computing for document declassification, http://www.osti.gov/opennet 35. Agichtein, E., Gravano, L., Pavel, J., Sokolova, V., Voskoboynik, A.: Snowball: A prototype system for extracting relations from large text collections. In: SIGMOD (2001) 36. Douglass, M.M., Clifford, G.D., Reisner, A., Long, W.J., Moody, G.B., Mark, R.G.: Deidentification algorithm for free-text nursing notes. Computers in Cardiology (2005) 37. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: Efficient full domain k-anonymity. In: SIGMOD (2005) 38. Sweeney, L.: Replacing personally-identifying information in medical records, the srub system. Journal of the Americal Medical Informatics Association (1996) 39. Sweeney, L.: K-anonymity: A model for protecting privacy. Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5) (2002) 40. Tveit, A.: Anonymization of general practitioner medical records. In: HelsIT 2004, Trondheim, Norway (2004)
Data Sharing in DHT Based P2P Systems Claudia Roncancio1 , Mar´ıa del Pilar Villamil2 , Cyril Labb´e1, and Patricia Serrano-Alvarado3 1
University of Grenoble, France [email protected] 2 University of Los Andes, Bogot´ a, Colombia [email protected] 3 University of Nantes, France [email protected]
Abstract. The evolution of peer-to-peer (P2P) systems triggered the building of large scale distributed applications. The main application domain is data sharing across a very large number of highly autonomous participants. Building such data sharing systems is particularly challenging because of the “extreme” characteristics of P2P infrastructures: massive distribution, high churn rate, no global control, potentially untrusted participants... This article focuses on declarative querying support, query optimization and data privacy on a major class of P2P systems, that based on Distributed Hash Table (P2P DHT). The usual approaches and the algorithms used by classic distributed systems and databases for providing data privacy and querying services are not well suited to P2P DHT systems. A considerable amount of work was required to adapt them for the new challenges such systems present. This paper describes the most important solutions found. It also identifies important future research trends in data management in P2P DHT systems. Keywords: DHT, P2P Systems, Data sharing, Querying in P2P systems, Data privacy.
1
Introduction
Peer-to-peer (P2P) systems take advantage of advances in networking and communication for providing environments where heterogeneous peers with high autonomy compose a system with a fully distributed control. P2P systems are the chosen platform for new style of applications where distributed data can be shared massively e.g., social networks [62], geo-collaboration systems [42], professional communities (medical, research, open-source software [2]). The development of massively distributed data sharing systems raises new and challenging issues. This results from the intrinsic characteristics of P2P systems
This work is supported by the ECOS C07M02 action.
A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 327–352, 2009. c Springer-Verlag Berlin Heidelberg 2009
328
C. Roncancio et al.
(distribution among a huge number of peers, dynamic systems configuration, heterogeneity of data and peers, autonomy of data sources, very large volume of shared data) which prevent the direct use of distributed algorithms issued from the more classical distributed systems and database worlds. Consequently, new approaches are being proposed but also existing algorithms are being revisited and adapted to provide high level data management facilities in the P2P context. The P2P context includes a large variety of systems, ranging from overlay networks and distributed lookup services, until high level data management services. Structured and unstructured overlay P2P networks exist and lead to systems with different characteristics. This paper concerns data sharing in structured P2P systems based on a Distributed Hash Table (DHT). It provides a synthesis of the main proposals on efficient data querying and data privacy supports. These two aspects are essential but challenging to implement in massively distributed data sharing systems. The absence of a global view and control of a system composed by a large set of volatile participants which can be both, data providers and requesters, implies that new querying mechanisms are needed. Query processors rely on the underlying overlay network and are expected to follow the P2P approach without introducing centralization points. This paper gives an overview of the main querying solutions proposed for systems sharing semi-structured or relational data but also for data type independent systems where queries are based on meta-data (attributes, keywords, etc). It also briefly discusses rich information retrieval approaches. Query languages, index structures and other optimization solutions, such as caches, will be analyzed. As P2P systems are very attractive to some communities (e.g., professional ones) willing to share sensitive or confidential data in a controlled way, data privacy support is an important issue. Nevertheless, the open and autonomous nature of P2P systems makes hard to provide privacy of peers and data. The P2P environment can be considered as hostile because peers are potentially untrusted. Data can be accessed by everyone and anyone, used for everything and anything1 and a peer’s behavior and identity can be easily revealed. This paper discusses recent results on privacy support on P2P DHT systems. It analyzes proposals to ensure a peer’s anonymity, to improve data access control and to use/adapt trust techniques. The paper is organized as follows. Section 2 introduces P2P DHT systems and a functional architecture of them. Section 3 analyzes several representative proposals providing declarative queries on top of DHT systems. It discusses the main design choices concerning data, meta-data, language expressiveness and index. More specific optimization aspects are discussed in Section 4. Section 5 concentrates on privacy issues. Section 6 concludes this paper and gives research perspectives on P2P DHT data management. 1
Profiling, illegal competition, cheating, marketing or simply for activities against the owner’s preferences or ethics.
Data Sharing in DHT Based P2P Systems
2
329
System Support for P2P DHT Systems
This section introduces a functional architecture for P2P DHT systems and its main provided services. It focuses on the specific services used by the high level querying solutions that will be analyzed in Section 3. P2P DHT systems have known great success mainly because their high scalability to organize large sets of volatile peers. They use a self-organizing overlay network on top of the physical network. One of their key characteristics is the efficient lookup service they provide and, very important in data sharing, their excellent support to provide comprehensive answers to queries. Several proposals exist [63,73,66,49]. They rely on the use of a distributed hash table to index participant peers and shared objects by using hash keys as identifiers. A DHT system splits a key space into zones and assigns each zone to a peer. A key is a point in this space and the object corresponding to this key is stored at the peer whose zone contains this point. Locating an object is reduced to routing to the peers hosting the object. One of the main differences between DHT solutions is their routing geometry. For instance, CAN [63] routes along a d-dimensional Cartesian space, Chord [73] and Pastry [66] along a ring and Viceroy [49] uses a butterfly network. Techniques used in the routing process are the fundamentals to build P2P DHT systems but several other services have been proposed to build more complete infrastructures. One of the reasons is semantic less access provided at this level: to find an object in such a system, its key has to be known. This key is then efficiently located among the peers in the system. Such a query, called location query, retrieves the physical identifier of the peer where the object is stored. Several proposals extending P2P DHT systems exist but there is no standard yet [21]. Let’s consider the three layer functional architecture presented in Figure 1. The lower level, labeled Distributed Lookup Service, provides the overlay network support, the second layer, Distributed Storage Service, provides data persistence services and the third layer, Distributed Data Services, manages data as semantic resources. The Distributed Lookup Service provides efficient mechanisms to find peers (identified by keys) on a distributed system by managing the routing information (neighbors set of a peer). Their main functions are: – lookup(key): returns the physical identifier of the peer in charge of a key. – join(key): registers a new peer in the distributed system and updates routing information. – leave(key): handles the departure of a peer and updates routing information. – neighbors(key): returns a list of keys identifying the peers in charge of the neighbor zones. This basic layer is used to build systems that give semantics to keys and provide data storage. Distributed Storage Services are responsible for stored object administration – insertion, migration, etc. Objects migrate from a peer that leaves the system to one of its neighbors to insure the object’s durability. Object migration
330
C. Roncancio et al.
Insert
Query
Distributed Data Management Service Insert Put(key,object)
Find object Get(key)
Distributed Storage Service
Overlay Network Support
Data persistence
PIER KadoP PinS MAAN KSS MLP PAST CFS DHash
Find peer
Lookup(key) Distributed Lookup Service Locate peer
Routing
CAN Chord Viceroy Pastry
Fig. 1. Functional architecture of P2P DHT systems
and replication (in neighbors of their storage peers) work to ensure that stored values never disappear from the system. Load balancing and caching strategies are also proposed. Examples of systems of this level are PAST [67], CFS [20] and DHash [29]. The main external functions of this layer are: – get(key): returns the object (or collection) identified by key. It uses the lookup function to locate the peers storing such objects. Answers are comprehensive, i.e., all answers present in the system are retrieved2 . – put(key, Object): inserts into the system an object identified by its key. It uses the lookup function to identify the peer where the object has to be stored. It is worth remarking that an object stored in such systems becomes a shared object. In practice, it is not easy to remove objects as temporally absent peers may have a copy of a deleted object. Coherency problems may arise when such peers are restored. The higher level of the proposed functional architecture offers services to allow semantic management of shared objects. This layer is important because underlying layers do not provide keyword search functions, multi attribute searches neither comparative query evaluation. They only offer access to objects giving their key3 and more powerful queries are hard to handle since the objects distribution criteria is based on semantic less keys. Several important proposals to im2 3
This characteristic can not be easily provided by unstructured P2P systems. Obtained through a known hash function.
Data Sharing in DHT Based P2P Systems
331
prove querying capabilities have been proposed [33,26,29,15,74,74,30,75,4]. They rely on distributed lookup and storage services. Chord is adopted by KSS [29], MAAN [15] and FCQ[74]. PAST has been chosen by PinS [75] and Kadop [4] whereas CAN is exploited by PIER [33]. Such high level querying services are presented in Section 3. To finish this section, it is worth mentioning some efforts [37,11,19] to optimize specific aspects of distributed lookup services. Proposals such as Baton [37] and Mercury [11] modify the structure of the overlay network to improve some kinds of research on keys. They concern mainly the optimization of range queries or queries concerning intervals of keys. Baton proposes a balanced tree structure to index peers whereas Mercury organizes peers in groups. Each group of peers takes in charge the index of an attribute. Peers in a group are organized in a ring and the attribute domain is divided to give the responsibility of an interval of values to each peer. Mercury proposes an integrated solution without distinction of the three layers.
3
Declarative Queries
The purpose of this section is to present an overview of a representative sample of works dealing with the evaluation of declarative high level queries in P2P DHT systems. As seen in the previous section, basic configuration of P2P DHT systems remains limited in term of research. Multi-attribute search based on equality and inequality criteria, or including join and aggregates operations, are not possible. Breaking this lack of semantic is a crucial point for the future of P2P systems. Significant progress have been achieved thanks to a lot of work made to improve declarative querying support. 3.1
Global View
The evaluation of declarative queries is made at the data management layer which relies on underlying layers (storage management and overlay network). Services provided by these layers are very important because they limit or allow some flexibility to data management services. Querying facilities offered by data management services are closely related to the nature of shared data and used meta-data. One can find dedicated services to a single type of data – for example [28] – or rather generic systems allowing sharing various types of resources such as [70]. The choice of meta-data used to describe and index resources determines usable criteria to formulate queries. One can find approaches that use keywords [29,64,70] whereas others are based on attributes [34,27,76]. In most cases [15,29,74,34,27], meta-data are stored using the storage management layer. Data and meta-data are identified by a key obtained with a hash function. Such an identifier will be associated, with a keyword [70,29], an attribute [29,28], a couple (attribute, value) [76], or a path in XML [27]. The equality operator is proposed by a wide range of systems [29,70]. More complex operators, like inequality, are much more difficult to implement because
332
C. Roncancio et al.
of the “blink” storage of shared resources. The storage management layer deals with the placement of shared resources by using hash functions to decide which peer will ensure the storage. This approach, which does not consider any semantics, makes it impossible to determine the most useful peer to help to resolve the query. One can find in [27,33,76], interesting solutions for query containing inequality operators. Even more complex operators like joining or sorting can be found in [74,34,60]. Massively distributed query processing systems came with new problems including query optimization. With regard to improvement of response time, proposals are based on index, caches [28,75], duplication [27] and materialized queries [29,75]. Evolving in a very high scaled environment also leads to Information Retrieval (IR) techniques. As a huge number of objects are shared returning the set of all objects satisfying a query may not be relevant, that is why research trends dealing with top-k operators are important. With high scale also comes a high semantic heterogeneity in data description, finding ways to bridge those semantics is crucial. In the following we give an overview of four classes of systems that are representative of the data management layer. The first three classes of systems are classified according to the structure adopted for the data and/or the meta-data whereas the fourth one is more related to IR trends. – Systems using meta-data composed by keywords or couples: KSS [29], MLP [70], PinS [77,75] and MAAN [15]. – Systems dealing with the relational model: PIER [33], and FCQ [74]. – Systems using the semi-structured or navigational model: DIP [28], KadoP [4], LDS [27]. – IR approaches: SPRITE [46], DHTop [8] and works presented in [81]. 3.2
Attribute Based Model
This first class of systems adopts simple meta-data composed by couples. Queries expressed on these attributes allow semantic search of data. KSS and MLP propose query languages with equality conditions on values of attributes. Whereas in PinS and MAAN queries including equality and inequality conditions are allowed. In addition to indexing techniques other optimization techniques are also targeted. A Keyword-Set Search System for Peer-to-Peer Networks (KSS) relies on DHash and Chord. The storage of objects and related meta-data is delegated to the storage management layer. Shared data have an identifier and can be described using meta-data that can be either keywords or couples. Metadata are used, one by one, to index related object in the DHT. Basically, one can find in the DHT entries like: or . These entries
Data Sharing in DHT Based P2P Systems
333
give respectively all objects, registered in the systems, which satisfy keyword1 or have attribute1 = value1. Queries are conjunctions of meta-data and answers are comprehensive. KSS is one of the first works allowing evaluation of equality queries over P2P DHT systems. It also proposes a systematic query materialization strategy that enhances query processing to the detriment of storage usage. Multi-level Partitioning (MLP) is also a keyword indexing system but it is build over the P2P SkipNet [30] system. Peers are structured in a Nmax leveled hierarchy of groups. Communication between groups is done using a broadcast process. This structuration allows a parallel evaluation of queries and the ability to return partial answer. A query is broadcasted to all groups on the next level of the hierarchy. At each level, several groups continue the broadcast of the query. The propagation of the query stops at the last level. Answers from each group are a partial answer that can be returned on the fly to the original peer that sent the query. Data are stored in the peer to which they belong and are not stored in the storage management layer. Only Meta-data (keywords) are used to index data thanks to the storage management layer. So data are published (indexed) but not stored in the DHT. MLP is a specific solution relying on the particularities of the underlying lookup service SkipNet. As a matter of fact, the use of the hierarchical structure and broadcasting are bases for query evaluation. This allows intensive use of parallelism during query evaluation. As a drawback, the number of contacted peers increases compared to a solution that does not rely on broadcasting. The fact that data are not stored in the DHT can not assure the comprehensiveness of the answer. Data may have gone with a peer but may be still indexed in the system. Peer to Peer interrogation and indexing (PinS) relies on PAST and Pastry. Queries can be composed of conjunction and disjunction of equality, inequality criteria and join-like operators [60]. Data can be either private or public. Data which are stored and indexed in the DHT are public and data which are indexed by the DHT but stored in a non-shared storage systems (not in the DHT) are called private. Several kinds of data indexing strategies in the DHT are proposed. The first one allows equality search and is similar to KSS. Three others are proposed to deal with queries composed of equality, inequality, and/or join terms. Those strategies are based on distributed indexes giving values related to one attribute. For a single query several evaluation strategy may exists and they can be selected according to the execution context to enhance query processing (see Section 4). PinS uses traditional DHT functionalities. For some of those indexes, the problem of peer saturation may arise due to the size of the index. To overcome this, PinS proposes dynamic index fragmentation and distribution.
334
C. Roncancio et al.
Multi-Attribute Addressable Network (MAAN) adopts the P2P approach for the discovery of shared resources in grids. Like PinS, queries are composed of equality and inequality criteria using meta-data. Shared resources are described using systems oriented meta-data (e.g. name, operating system, CPU). Like others solutions, the query evaluation is based on distributed index. This system uses a hashing function with a special property, namely a uniform locality preserving hashing function. This hashing function preserves order and assures a uniform distribution. MAAN allows the evaluation of inequality terms with a strategy based on its ordered hashing function. There are several drawbacks for solution based on an ordered hashing function. Building the hashing function implies the knowledge of each attributes distribution. For attributes with a large domain of values in which only few are used, the evaluation process will require useless processing in many peers. The evaluation process relies on the successor function which is a functionality that may not be a standard function for P2P systems. This function may not be so simple in some overlay networks. 3.3
Relational Model
The second group is composed of works which consider the P2P system as a distributed relational database (PIER, DHTop and FCQ). They allow the evaluation of queries using operators of selection, projection, join and aggregation. Such systems combine traditional database algorithms and hashing functions. Peer-to-Peer Infrastructure for Information Exchange and Retrieval (PIER) relies supposes that underlying layers (storage and lookup services) include special functionalities. In particular, they must notify data migration caused by configuration modifications (join/leave of peers), and more over they must allow suppression of data and group communication. This assumption allows PIER to offer new functionalities. For example, the management of temporary resources, when data are stored only for a predefined time duration, can be used in complex query processing. PIER evaluates SQL queries on tuples stored in the P2P system. It defines 14 logical operators, 26 physical operators and three types of index to evaluate queries. A set of optimized algorithms based on symmetric hash join is proposed and query evaluation may include re-hashing and/or temporary storage of intermediate results. PIER allows the use of inequality terms and several indexes are proposed to enhance the query processing. Prefix Hash Tree [61] (PHT) is an example of one of them. Underlying DHT systems must provide the group concept, representing relation concept of the relational model, and temporary storage of data. Framework for complex query processing (FCQ) is a framework which offers evaluation of relational operators like selection, projection, join and group by. Tuples of relational tables are stored and indexed in the underlying distributed storage service. The architecture of the system is based on a hierarchy of peers.
Data Sharing in DHT Based P2P Systems
335
A super-peer layer named range guards is used to store copies of tuples, a simple peer only stores pointers to super-peers. FCQ uses a single index structure for the evaluation of all types of queries. It imposes special conditions on underlying systems and on hashing function as it uses a super peer layer and an order preserving hashing function. As a consequence, the evaluation of inequality terms is simple. FCQ, like MAAN, uses the successor function of the distributed storage service for query propagation. This has the same drawbacks as the ones already mentioned for MAAN. 3.4
Semi-structured or Navigational Model
This section deals with works on XML. A lot of recent proposals concern this kind of data. Among them DIP, KadoP and LDS, presented in the following. Query processing optimization, for example, by using bloom filters [38,3] have been proposed so as some hints to reduce the size of indexing structures [13,24]. Data Indexing in Peer-to-Peer DHT Networks (DIP) Data are shared in the DHT system and are described through the use of a descriptor. This descriptor is semi-structured XML data and has the form of XPath query. This descriptor is used for the generation of a set of so called “interesting queries” that index the data. The query language is a subset of XPath queries which are “complete paths” in the XPath terminology [35]. As optimization DIP proposes the creation of indexed queries and the use of caches (see Section 4). These optimizations allow respectively an iterative evaluation of queries and the enhancement of the response time. Nevertheless, the mechanism to choose queries to be materialized may affect the performance and the coherency of the solution. DIP may be used on top of most DHT as it only uses simple functions (put and get). KadoP [3,4], has been built over FreePastry4 [65] and uses Active XML [1]. It focuses on sharing XML, HTML or PDF documents as well as web services. These resources are described by using DTD files (conform to XML schemas [35]) or WSDL descriptions [78] for web services. Meta-data are structured using relations such as partOf and isA. Meta-data contain semantic concept that enrich queries on data. Concepts and values are linked through the use of the relatedTo relation. KadoP also uses namespaces to structure spaces of definitions. Meta-data are shared and stored in the P2P system. So data are indexed/published but original data are stored in the owning peer and are not stored in the DHT. Published data are identified by URIs or URLs which give localization information. Different kinds of indexes are introduced. These indexes, registered in the DHT, allow the publication of name spaces, concepts, relations between concepts, sets of type in a namespace and finally the semantic link between two concepts. All these allow the evaluation of semantically rich queries. Indexing and evaluation are closely related to LDS as shown in the next paragraph. 4
A free implementation of Pastry.
336
C. Roncancio et al.
The main drawback of this solution is the fact that the quantity of information used for indexing concepts, relations between concepts is huge and requires a very large amount of storage resources. Techniques for going through this point may be found in [3]. Locating Data Sources in Large Distributed Systems (LDS) is an indexation service for XML data, it uses XPath for interrogation. LDS relies on traditional functions, so it is independent of the underlying P2P layers. Data stay in the owning peer and are published and described using “resumes”. The first element of a resume is a set of paths to an attribute, the second is the set of values for this attribute and the third is the set of id of peers containing data. Resumes may be compressed using techniques like histogram [52] and bloom filters [43]. The scaling of this solution may be affected by the frequent use of specific meta-data. The peer in charge of the storage of this meta-data may overload. Therefore, a fragmentation strategy to deal with this issue is proposed. LDS also proposes the duplication of resumes linked to frequently used meta-data. But this solution is kept as a last recourse as problems of coherency may arise. Proposed query processing reduces the volume of transfered data as intermediate answers do not contain all objects satisfying a query but only id of peers that are storing such object. As drawbacks, this introduces an additional step of evaluation to identify relevant objects in these peers. Proposed indexation may allow the evaluation of all types of queries in an homogeneous way, but indexes contain a huge quantity of data. XPath for P2P (XP2P) presented in [13] deals with these problems and proposes to index only fragments of XML documents. In addition LDS indexes are affected by the arrival or departure of peers and this may be a problem in very dynamic systems. 3.5
Information Retrieval Trends
Infrastructure to support exact query evaluation is an important step forward, but important challenges still remain. One of them is dealing with semantic heterogeneity coming from a large number of different communities bridged together in a P2P system. [81] deals with rich information retrieval in this context. This system doesn’t try to offer exact answers. Instead, it seeks approximate answers like in retrieval information systems. Users are looking to find files “semantically close” to a file they already have. The proposed solution is based on vectors of terms for indexation and querying, and on an order preserving hashing function for evaluation. Classical rappel and precision are used to measure the quality of an answer. SPRITE focuses on reducing the amount of storage space required to index documents. The idea is to ignore the terms used for indexing documents that are never used in queries. An other important trend concerns the huge number of shared objects and the potentially too large size of answers returned to a simple query. To deal with this problem DHTop [8] focuses on evaluating queries with a top-k [51] operator.
Data Sharing in DHT Based P2P Systems
337
Like PIER, DHTop deals with relational data. Special indexes are introduce to deal with inequality and top-k answers. Based on the fact that the top-k operator is important in large centralized databases, it seems obvious that its study in P2P DHT systems is very important.
4
Query Processing Optimization
Sections 2 and 3 provide core functionalities enabling P2P DHT systems to be used in applications like web applications – search engines [55], P2P streaming [56]– as well as in the security domain– generation of digital certificates and intrusion detection [45]. In this style of applications, query processing optimization is essential. On the one hand, the systems in Section 2 improve maintenance processes and storage management for reducing the complexity of routing processes and resource usage. On the other hand, systems in Section 3, for enhancing query language capabilities, provide several kinds of evaluation strategies taking into account query style and the information about system state. The use of optimizers for improving query performance is a typical practice in data management systems. Those optimizers use statistical information to support their decisions on the availability of structures as indexes and caches. This Section focuses on elements for supporting query processing optimization and provides a description in a top-down way according to abstraction levels. Section 4.1 presents optimizers based on static optimization, Section 4.2 analyzes elements such as distributed cache and Section 4.3 is about statistics. 4.1
Optimizers
P2P systems optimizers research faces several challenges. Major challenges are related to the impossibility of having global catalogs, the large number of peers and their dynamicity, and the difficulty to build/maintain statistics in these systems. The first part of this Section gives a brief description about some works on P2P optimizers, and the second part highlights decisions about optimization in some of the works described in Section 3. P2P DHT Optimizers. Several works have analyzed and proposed different styles of P2P optimizers ranging from classic optimizers to optimizers based on user descriptions. Classic optimizers. [14] proposes a P2P optimizer based on the super-peers existence and the concept of classic optimizers using distributed knowledge about schemes in the network as well as distributed indexes. This work proposes several execution plans considered as optimal and characterized by costs calculated using statistical information. Indexes provide information to super-peers for them to decide the place to execute a sub query. Two places are possible: locally in the contacted super-peer or in an other super-peer. The cost associated to one plan enables the selection of the best execution plan. This work does not consider the dynamicity of peers.
338
C. Roncancio et al.
Optimization defined by the user. Correlated Query Process (CQP) [17] and PIER [32] are opposite proposals to the first one. In these cases, the user has all the responsibility of the query optimization process. CQP defines a template with information about peers to contact and about flows of data and processes. A query execution is made using that template for generating parallel execution between steps and using messaging for synchronization. On the other hand, PIER provides a language to define in a specific manner, a physical execution plan. This language named UFL (Unnamed Flow Language) contains logics and physical operators used by an expert user to submit a query. These operators do not include information about a peer responsible for its execution. This decision is taken for PIER according to index information. In these proposals the user is the key for providing a query optimization. PIER proposes a user interface to support physical plans building. Still, the assumption about user responsibility is very strong for inexpert users. Range and join queries optimization. This Section presents some conclusions about optimization features included in the works of Section 3. Works will be presented according to the query type to be optimized. Range queries. An interesting conclusion, based on MAAN and Mercury works, about range queries optimization is that term selectivity is important for deciding the terms execution order. In fact, terms execution order can impact the number of peers contacted. MAAN concludes that a term selectivity greater than 20% increases the number of contacted peers (that could be 50% of total peers) and the number of messages used in the strategy increases as well and could be close to the number of messages used by a flooding strategy. In the same way, Mercury shows that selection of the term with the lowest selectivity reduces by 25% or 30% the number of contacted peers (and the number of messages) compared to a random term selection. PierSearch [47] (c.f. Section 4.2) explores the possibility to work with a DHT system and a non structured P2P system for obtaining a more complete answer as well as for improving the execution time. It shows that the answer recall increases according to a threshold used to identify rare objects. In particular, when the number of objects in a query result is 3 the answer recall is 95% and with 10 objects the answer recall is 99%. In other cases, queries can be executed in a more efficient manner using a non structured P2P system. Join queries. Works as PIER and PinS highlight interesting results about the use of network resources in queries containing joins terms. PIER [33] proposes four join algorithms: Symmetric Hash Join (SHJ), Fetch Matches Join (FMJ), Symmetric Hash Semi Join and Bloom Filter - Symmetric Hash Join. It concludes that SHJ is the most expensive algorithm according to network resources, FMJ consumes an average of 20% and optimizations of SHJ use less resources, although this depends on the selectivity of conditions on the smaller relation. PinS [60] complements PIER work proposing three join query evaluation strategies: Index Based Join (IBJ), Nested Loop Join (NLJ) and NLJ with reduction in the search space.
Data Sharing in DHT Based P2P Systems
339
PinS concludes that IBJ needs, in most cases, less messages. When the join cardinality is less than the size of the search space NLJ is more efficient. 4.2
Cache
The literature reports specific proposals on caching related to P2P DHT systems [36,47,72,10] but also adaptable and context aware caching services [23,41,53] that can be used in such systems. This Section presents first cache solutions implemented in P2P DHT Systems and then some proposals where DHT systems themselves are used as a cache. DHT Systems using cache. At the distributed storage service level there are works as DHash [16] and PAST [67] (see Section 2) that propose a cache to ameliorate queries on popular objects. In those works the load distribution on peers is improved to reduce answer times. In the distributed data management service layer, there are works as DCT [72], PinS [75,59] and CRQ [68] that improve the answer time by reducing the processing cost of queries previously executed. These proposals improve the traffic consumption and enable the use of a semantic cache for queries composed of equalities and inequalities terms. Although there are cache proposals in both distributed storage service and distributed data management service layers, it is difficult to reuse these solutions in new proposals as it is not clear how to provide cooperation between heterogeneous caches and as they do not give information about the context in which they are used. In the next part of this Section, there is a brief description of some representative works. PAST and DHash are quite similar in their behavior. They propose popular objects for caching using space available on peers. PAST takes advantage of the peers contacted during a routing process to cache objects. An object in the PAST cache is a couple composed of a key and a set of identifier related to this key. A peer contacted in a routing process searches locally in its cache one entry to answer the request. Then, when an cache entry is not found the typical routing process continues. The cache entries are evicted from the cache using a GD-S [67] policy and are discarded at any time. DIP (see Section 3) uses a query cache and entries in the cache contain the query with its identifier and all descriptors associated to objects shared in the system. These entries are registered in the peer contacted to evaluate the submitted query or in all peers visited for providing the query evaluation. In a search process, the local cache is visited before going to the peer responsible for the submitting query identifier. DIP proposes LRU as replacement policy. CRQ provides a cache for range queries using CAN [63] as a distributed lookup service. CRQ has two types of exclusive entries. The first one is entries about row identifiers of specific query. The second one contains information about peers storing in their cache answers about a query. Once a query is required to a
340
C. Roncancio et al.
peer, it searches in peers associated with the complete query or to a sub query. Consequently, the complete answer for the submitted query or a partial answer, used to calculate the total answer, can be found in the distributed cache. CRQ guaranties a strong coherence, updating the cache when new objects are inserted in the system. Moreover, CRQ proposes the use of LRU as a replacement policy. PinS proposes two query caching solutions. The first one [75] about queries including only equality terms whereas in the second case, PinS proposes a cache including range queries. Unlike other cache proposals, in the first cache, PinS provides a prefetch cache using statistics about queries frequency to identify the most popular queries. These queries are the candidates to be stored in the cache. As a new object arrives in the system, all entries containing queries that include the new object are updated, providing a strong coherence. A cache entry is indexed by a term identifier and contains a query with all objects shared in the system verifying its conditions. This tuple is stored in the peer responsible for the term identifier, related to the first term, in alphabetic order, of a query. When an equality query is demanded, PinS contacts all peers responsible for terms included in the query. These peers search in their local caches, queries related to the term demanded. When a peer finds a query closest to the query required by the user, it sends as answer the cache entry and not the answer related to the term that it manages. The second cache type [59] includes range terms evaluation. In this case, a cache entry is indexed by the query identifier. The cache entry contains information about a query and a couple atra = valj together with objects identifiers associated to the attribute value. Queries are normalized to find the couple representative of the query. The cache entries enable to evaluate exact and partial coincidence of queries within cache. Each entry uses a TTL to determine their freshness as well as a Time to Update. Both PinS proposals use LRFU as replacement policy. DCT identifies frequent queries as candidates to be stored in the cache. These queries are stored as entries in a local peer only if there is available storage space. Otherwise, the results set of these queries is shortened and top k answers are selected and stored in the cache. Finally, if the space is not enough, a replacement policy is used to identify the entry cache to be deleted from the cache. There are two kinds of entry cache. The first one is indexed by a term contained in the requested query. This term is selected in a random manner. This entry contains all or partial document identifiers that answer the submitted query. The second one is indexed according to the document identifier (URI) and contains a frequent query together with all URI’s satisfying the query. When a query is subsumed in the system, this query is decomposed in sub queries to identify peers with information into the cache to answer the query. When cache information is not enough to answer a query, a broadcast to all peers responsible for one of the query terms is made. DHT Systems used as cache. Squirrel [36] and PierSearch [47] are examples of DHT systems used as cache. Squirrel tries to reduce the use of network
Data Sharing in DHT Based P2P Systems
341
resources and PierSearch improves the answer recall. It is difficult to generalize a cache proposal because they are coupled with the core proposals and there are some important issues like object placement, replacement policy and coherence models, that are not described or studied. The rest of this Section presents a brief description about some representative works. Squirrel enhances searches on web pages. In the same way as CRQ, it proposes two storage techniques. Both of them use the URL of a web page for generating object identifiers. In the first technique, a cache entry is indexed by URL’s and contains the web page identified by the URL. In the second one, a cache entry is composed of information distributed in several peers. The peers demanding a query on a WebPage, named delegate peers, store the first type cache entries. Additionally, the peer (Pr ) responsible for the URL key stores information about delegate peers maintained in their caches information about the web page involved in the query. When a web page is demanded, the browser searches the page in its local cache. When the page is not found or the version stored is not adequate, Squirrel is contacted to search the page. Squirrel searches in Pr the web page or information about delegate peers. Finally, when a miss is produced, Squirrel searches the web page in the server. Squirrel uses TTL as freshness technique and LRU for replacement policy. PierSearch improves query evaluation on objects, considered as rare for Gnutella, on a P2P DHT System. Several strategies are studied to classify an object as rare. For example, an object is considered rare if the use frequency of one meta-data (associated to this object) or the number of answer query (including this object), are lower than a threshold. Two types of entry cache are proposed. The first one, indexed by all object information like name, size, IPAdress, and PortNumber, contains all this information. The second one, is indexed by one meta-data, and contains all meta-data associated to one object and its object identifier. The use of PierSearch occurs when a rare object is requested in Gnutella. In this case, the search is made using PierSearch and not the typical search on Gnutella. TTL are used as in Squirrel. 4.3
Statistics
Statistics enable optimizers to decide on the best execution plan including decisions about peers to contact, physic algorithms to use as well as creation of new indexes or the inclusion of new terms in an index. In a general context, a statistic is information about a system including a storage value, and a time to live. Statistics management is a big difficulty one faces. Consequently, a process to update statistic values and to identify events affecting the statistic will be provided The use of notification mechanisms as publication/subscription are a typical practice for communicating events to peers involved in statistic management. Several works tackle issues about statistics in P2P. Still, the number of works is low with regards to works about declarative queries on P2P DHT. The main topics analyzed in statistics research concern decisions about what statistics will
342
C. Roncancio et al.
be gathered and how to provide statistics management. This Section focuses on the second topic because it is closest to declarative queries and presents some works with different characteristics to exemplify the research about statistics on P2P context. The statistics works in P2P DHT systems can be evaluated according to distinct quality attributes [54]. The most important attributes are efficiency, scalability, load balance and accuracy. Efficiency is related to the number of peers contacted during the calculation of the statistic value. Scalability depends on the way to distribute the calculation process; load balancing is focused on increasing robustness in the statistic management process according to storage and access policies; and accuracy affects the quality of decision process using statistics values. Mercury [11] provides an evaluation of queries composed of several attributes, as described in Section 3. Mercury uses statistics to provide a load balance according to storage dimension as well as to estimate selectivity of queries. It gathers information about data distribution and number of peers classified by range values. Additionally, it provides random sampling as statistic management strategy to maintenance local histograms. PISCES [79] provides data management based on BATON (see Section 2). Unlike other works, PISCES proposes a mechanism to identify in a dynamic manner terms to be indexed in the system. The use of statistics enables dynamic indexation. In fact, PISCES gathers information about the number of peers in the system, average churn rates, number of all executed queries, and query distribution. PISCES uses a gossip protocol [54] for providing maintenance on local histograms. Smanest [58] is a P2P service, contrary to the first’s proposals, for providing statistics management on top of P2P DHT systems. It enables creation, maintenance and deletion of different kinds of statistics in a decoupling manner of the P2P DHT system. At the same time, Smanest provides a uniform access to statistics as well as guarantees a transparent solution according to statistics placement. An update process is proposed, giving to the user the option to determine the kind of update policy. Two types of updated policies are provided: immediate and deferred. Immediate specifies the update on statistic value after an event changes it. Whereas, deferred policy defines a time period or a number of events to trigger the update process. In the last two cases, the statistical accuracy can be affected. Smanest proposes an automatic mechanism for statistics management, based on events monitoring, and for events notification in an efficient manner. In fact, collecting statistics in P2P systems is a crucial and hard task. Crucial because they will be needed to allow good performances and hard because the usual way to deal with them are not well fitted to P2P systems. The same kind of challenges appear for privacy concerns which are the focus of the next Section.
Data Sharing in DHT Based P2P Systems
5
343
Data Privacy
This section analyses privacy issues addressed by P2P DHT systems. It first introduces the data privacy problem and then discusses access restriction (Section 5.2), anonymity (Section 5.3) and trust techniques (Section 5.4). 5.1
Rationale
Currently, the democratization of the use of information systems and the massive data digitalization allow us to identify all aspects of the a person’s life. For instance, their professional performance (e.g., publish or perish software, dblp website), their client’s profile (e.g., thanks to fidelity smart cards), their user’s profile (e.g., thanks to their user identity, access localities, generated traffic) or their health level (e.g., thanks to the digitalization of medical records). Those issues have given rise serious data privacy concerns. P2P data sharing applications, due to their open and autonomous nature, highly compromise the privacy of data and peers. The P2P environment can be considered as hostile because peers are potentially untrustworthy. Data can be accessed by everyone, used for everything (e.g., profiling, illegal competition, cheating, marketing or simply for activities against the owner’s preferences or ethics) and a peer’s behavior and identity can be easily revealed. It is therefore necessary, in addition to other security measures, to ensure peer anonymity, improve data access control and use/adapt trust techniques in P2P systems. To limit the pervasiveness of privacy intrusion, efforts are being made in data management systems [57]. Concerning P2P systems, several proposals take into account security issues (integrity, availability and secrecy) that are closely related to privacy. For instance, in [9], authors discuss various P2P content distribution approaches addressing secure storage, secure routing and access controle; [71] proposes an analysis of security issues in P2P DHT systems. It provides a description of security considerations when peers in the DHT system do not follow the protocol correctly, particularly secure assignment of node IDs, secure maintenance of routing tables, and secure forward of messages; and [48] looks into the security vulnerabilities of overlay networks. Nevertheless, very few works propose solutions to preserve privacy. In [12] a comparison of several unstructured P2P systems is made. The comparison focuses on four criteria, namely, uploader/downloader anonymity, linkability (correlation between uploaders and donwloaders) and content deniability (possibility of denying the knowledge of the content transmitted/stored). [9] also presents an analysis of authentication and identity management. It is worth noting that most of these works concern unstructured P2P systems which are outside of the scope of this paper. This section discusses the three data sharing P2P DHT systems that, to our knowledge, address data privacy in P2P DHT systems – PAST, OceanStore and PriServ. PAST [67], as mentioned in Section 2 is a distributed storage utility located in the storage management layer (see Figure 1). Besides providing persistence,
344
C. Roncancio et al.
high availability, scalability and load balancing it focuses on security issues. In PAST, peers are not trusted (except requester peers) and its proposal limits the potentially negative impact of malicious peers. OceanStore [44] is a utility infrastructure designed for global scale persistent storage which relies on Tapestry [80]. It was designed to provide security and high data availability. As in PAST, server peers are not trusted. PriServ [39,40] is the privacy service of the APPA infrastructure [7]. It was designed to prevent malicious data access by untrusted requesters. APPA has a network-independent architecture that can be implemented over various structured and super-peer P2P networks. The PriServ prototype uses Chord [73] as overlay network (similar to KSS, MAAN and FCQ (see Section 2)). OceanStore and PriServ provide distributed data management services (see Figure 1). They were not analyzed in the preceding sections of this paper because they do not propose particular querying mechanisms. They use a basic hash key search. However, OceanStore and PriServ propose interesting solutions to enforce data privacy. The next sections analyze the three aforementioned systems with respect to access restrictions, anonymity support and trust management. 5.2
Access Restriction
In OceanStore, non public data is encrypted and access control is based on two types of restrictions: reader and writer restrictions. In the reader restriction, to prevent unauthorized reads, data are encrypted (with symmetric-keys). Encryption keys are distributed to users with read permissions. To revoke the read permission, the data owner requests that the replicas be deleted or re-encrypted with a new key. A malicious reader is able to read old data from cached copies or from misbehaving servers that fail to delete or re-key. This problem is not specific to OceanStore, even in conventional systems there is no way to force a reader to forget what has been read. To prevent unauthorized writes, writes must be signed so that well-behaved servers and clients can verify them against an access control list (ACL). The owner of an object can choose the ACL for an object by providing a signed certificate. ACL are publicly readable so that server peers can check whether a write is allowed. Thus, servers restrict writes by ignoring unauthorized updates. In PriServ, the access control approach is based on Hippocratic databases [6] where access purposes are defined to restrain data access. Additionally, in PriServ authors propose to take into account the operation (read, write, disclosure) that will be applied. The idea is that in order to obtain data, peers specify the purpose and the operation of the data request. This explicit request commits clients to use data only for specified purposes and operations. Legally, this commitment, may be used against malicious clients if data is used for other purposes/operations. To make data access control, purposes and operations are included in the generation of data keys. For this a publicly known hash function hashes the data
Data Sharing in DHT Based P2P Systems
345
reference, the access purpose and the operation (it is considered that those parameters are known by peers). Thus, the same data with different access purposes and different operations have different keys. According to the authors, previous studies have shown that considering 10 purposes allows to cover a large number of applications. In addition, the number of operations considered is only 3. Server peers are untrusted in PriServ. Two functions to distribute data are proposed: publishReference() and publishData(). In the first one, owner and data references are distributed in the system but not data themselves. When a peer asks for data, server peers return only the data owner reference. Requester peers must contact data owners to obtained data. In the second function, encrypted data is distributed. When a peer requests data, server peers return the encrypted data and the data owner reference that stores the decryption key. Owner peers use public-key cryptography to send decryption keys to requester peers. In PriServ a double access control is made during data requesting. Servers make an access control based on the information received during data distribution. A more sophisticated access control is made by owners where, in particular, trust levels of requester peers are verified. Malicious acts by servers are limited because, for each request, peers contact data owners to obtain data (if data have been published with the publishReference() function) or encryption keys (if the publishData() function has been used. In several systems, the lack of authentication is overcome by the distribution of the encryption keys, necessary for accessing content, to a subset of privileged users. As in OceanStore and PriServ, in PAST, users may encrypt their data before publishing. This feature conduces to the deniability of stored content because nodes storing or routing encrypted data cannot know their content. This protects router or server privacy because they are not responsible for the content they transfer/store. 5.3
Anonymity
In [22], the authors underline the need for anonymity in P2P systems. Anonymity can enable censorship resistance, freedom of speech without the fear of persecution, and privacy protection. They define four types of anonymity: 1) author anonymity (which users created which documents?), 2) server anonymity (which nodes store a given document?), 3) reader anonymity (which users access which documents?) and 4) document anonymity (which documents are stored at a given node?). In PAST, each user holds an initially unlinkable pseudonym in the form of a public key. The pseudonym is not easily linked to the user’s actual identity. If desired, a user may have several pseudonyms to obscure that certain operations were initiated by the same user. PAST users do not need to reveal their identity, the files they are retrieving, inserting or storing. PriServ and OceanStore do not use anonymity techniques. In systems where anonymity is not taken into account, it is possible to know which peer potentially can store which data. This knowledge facilitates data censorship [25,31,69].
346
C. Roncancio et al.
5.4
Trust Techniques
[50] discuses the complex problem of trust and reputation mechanisms and also analyzes several distributed systems and some unstructured overlay systems. In PAST, requester peers trust owner et server peers thanks to a smartcard held by each node which wants to publish data in the system. A private/public key pair is associated with each card. Each smartcard’s public key is signed with the smartcard issuer’s private key for certification purposes. The smartcards generate and verify various certificates used during insert and reclaim operations and they maintain secure storage quota system. A smartcard provides the node ID for an associated PAST node. The node ID is based on a cryptographic hash of the smartcard’s public key. The smartcard of a user wishing to insert a file into PAST issues a file certificate. The certificate contains a cryptographic hash of the file’s contents (computed by the requested node) and the fileId (computed by the smartcard) among others. In PriServ, every peer has a trust level. An initial trust level is defined depending on the quality of peers, then it evolves depending on the peer’s reputation in the system. There is no a global trust level for a peer. The perception of the trustworthiness of peer A may be different for that of peer B or even that of the peer C. Locally, peers have a trust table which contains the trust level of some peers in the system. If a server peer does not know the trust level of the requesting peer locally, it asks its friends (peers having a high trust level in its trust table), and if a friend does not have the requested trust level, it asks for it from its friends like a flooding limited by a TTL (time to live). If a peer does not have friends it will request the trust level from all peers in its finger table (log(N) peers).
6
Conclusion and Perspectives
P2P DHT systems clearly provide a powerful infrastructure for massively distributed data sharing. This paper introduced first a global functional architecture which distinguishes the underlying distributed lookup services to maintain the P2P overlay network, then a distributed storage service layer providing data persistency (involving data migration, replication and caching), and then a last layer consisting of high level data management services as declarative query processors. The paper then extensively discussed query and data privacy supports. These discussions reveal some maturity in querying solutions but also deficiencies on providing data privacy. This paper considered many proposals on query processing on top of P2P DHT systems covering a large variety of data models and query languages. Such proposals allow queries on shared data in a “P2P database” style as well as with less structured approaches. That is either by providing querying support on meta-data associated to potentially any object (whatever its type), or by using approximate queries in a “P2P information retrieval” style. Optimization issues have been analyzed as they are crucial in developing efficient and scalable P2P query processors. As such query processors operate
Data Sharing in DHT Based P2P Systems
347
in dynamic massively distributed systems without global control and complete statistics, query optimization becomes very hard. For almost each operator of the query language it is necessary to find a specific optimization solution. Join operators, range queries and top-k queries have been given particular attention because their evaluation can be extremely time and resource consuming. Caching issues have also been presented, mainly as an approach to improving the data management services. Several operational solutions exist but stale data management is not yet optimum. The last topic discussed in, this paper was data privacy support. This major issue has surprisingly received little attention until now in P2P DHT systems (most efforts concern unstructured P2P systems). Some access restriction techniques, anonymity support and trust management have been proposed but more complete proposals are required. Providing appropriate privacy is certainly essential to allow more applications (particularly industrial ones) to rely on P2P data sharing. Thus this important issue has several interesting perspectives and challenging open problems. For example, more effort is needed to prevent and limit data privacy intrusion but also to verify that data privacy has been preserved. Verification can be made with auditing mechanisms that can be based on techniques like secure logging [18] of data access or even watermarking [5]. An auditing system should detect violations of privacy preferences and punishments (or rewards) to misbehaving (or honest) peers should be considered. It is important to notice the existence of a tradeoff between anonymity and data access control. Veri?cation can hardly be implemented in environments where everyone is anonymous. Other privacy issues concern the distributed storage services. For instance how to avoid storing data on peers that are untrustworthy by data owners. Currently, data are stored on live peers whose key is the closest one to the data key (e.g., the successor function of Chord). It could be interesting to define DHT where data owners can influence the distribution of their data. Further important research perspectives concern data consistency and system performances. Data consistency issues are out of the scope of this paper but are nevertheless important in data sharing systems. Data shared through P2P DHT system have been considered as read only by the majority of proposals. One of the main reasons is the impossibility to guarantee update propagation to all peers holding a copy of data5 . The literature reports recent proposals on data consistency support in P2P DHT systems but this aspect needs further research. System performance also reveals some topics for research. One problem is cross layer optimization combining the optimization solutions implemented by the distributed storage services and data management services. For example, such services may propose various caching approaches that should be ”coordinated” to tune the system. Another important aspect is better support of peers volatility and dynamic system configuration by the data management services. Current query supports implicitly consider P2P systems with low churn rate. Such supports could become inefficient or not work when there is a high churn 5
Disconnected peers could conserve stale copies.
348
C. Roncancio et al.
rate in the system. Data management services should therefore be highly adaptable and context aware. This will be even more important if mobile devices participate in the system. Lastly, the issue of semantic heterogeneity of shared data is important in all data sharing systems and is even more accentuated in P2P environments where peers are extremely autonomous. There are certainly other important unsolved problems but the aforementioned ones already will lead to much exciting research. Acknowledgement. Many thanks to Solveig Albrand for her help with this paper.
References 1. Abiteboul, S., Benjelloun, O., Manolescu, I., Milo, T., Weber, R.: Active XML: A Data-Centric Perspective on Web Services. In: Demo Proc. of Int. Conf. on Very Large Databases (VLDB), Hong Kong, China (August 2002) 2. Abiteboul, S., Dar, I., Pop, R., Vasile, G., Vodislav, D.: EDOS Distribution System: a P2P Architecture for Open-Source Content Dissemination. In: IFIP Working Group on Open Source Software (OSS), Limerick, Ireland (June 2007) 3. Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: XML Processing in DHT Networks. In: Int. Conf. on Data Engineering (ICDE) (April 2008) 4. Abiteboul, S., Manolescu, I., Preda, N.: Sharing Content in Structured P2P Networks. In: Journ´ees Bases de Donn´ees Avanc´ees, Saint-Malo, France (October 2005) 5. Agrawal, R., Haas, P., Kiernan, J.: A System for Watermarking Relational Databases. In: Int. Conf. on Management of Data (SIGMOD), San Diego, California, USA (June 2003) 6. Agrawal, R., Kiernan, J., Srikant, R., Xu, Y.: Hippocratic Databases. In: Int. Conf. on Very Large Databases (VLDB), Hong Kong, China (August 2002) 7. Akbarinia, R., Martins, V., Pacitti, E., Valduriez, P.: Design and Implementation of APPA. In: Baldoni, R., Cortese, G., Davide, F. (eds.) Global Data Management. IOS Press, Amsterdam (2006) 8. Akbarinia, R., Pacitti, E., Valduriez, P.: Processing Top-k Queries in Distributed Hash Tables. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 489–502. Springer, Heidelberg (2007) 9. Androutsellis-Theotokis, S., Spinellis, D.: A Survey of Peer-to-Peer Content Distribution Technologies. ACM Computing Surveys 36(4) (2004) 10. Artigas, M.S., L´ opez, P.G., G´ omez-Skarmeta, A.F.: Subrange Caching: Handling Popular Range Queries in DHTs. In: Hameurlain, A. (ed.) Globe 2008. LNCS, vol. 5187, pp. 22–33. Springer, Heidelberg (2008) 11. Bharambe, A., Agrawal, M., Seshan, S.: Mercury: Supporting Scalable MultiAttribute Range Queries. In: Int. Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), Portland, Oregon, USA, August-September (2004) 12. Blanco, R., Ahmed, N., Sung, D.H.L., Li, H., Soliman, M.: A Survey of Data Management in Peer-to-Peer Systems. Technical Report CS-2006-18, University of Waterloo (2006) 13. Bonifati, A., Cuzzocrea, A.: Storing and Retrieving XPath Fragments in Structured P2P Networks. Data & Knowledge Engineering 59(2) (2006)
Data Sharing in DHT Based P2P Systems
349
14. Brunkhorst, I., Dhraief, H., Kemper, A., Nejdl, W., Wiesner, C.: Distributed Queries and Query Optimization in Schema-Based P2P-Systems. In: Int. Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P), Berlin, Germany (September 2003) 15. Cai, M., Frank, M., Chen, J., Szekely, P.: MAAN: A Multi-Attribute Addressable Network for Grid Information Services. In: Int. Workshop on Grid Computing (GRID), Phoenix, Arizona (November 2003) 16. Cates, J.: Robust and Efficient Data Management for a Distributed Hash Table. Master thesis, Massachusetts Institute of Technology, USA (May 2003) 17. Chen, Q., Hsu, M.: Correlated Query Process and P2P Execution. In: Hameurlain, A. (ed.) Globe 2008. LNCS, vol. 5187, pp. 82–92. Springer, Heidelberg (2008) 18. Chong, C.N., Peng, Z., Hartel, P.H.: Secure Audit Logging with Tamper-Resistant Hardware. In: Int. Conf. on Information Security (SEC), Athens, Greece (May 2003) 19. Costa, G.D., Orlando, S., Dikaiakos, M.D.: Multi-set DHT for Range Queries on Dynamic Data for Grid Information Service. In: Hameurlain, A. (ed.) Globe 2008. LNCS, vol. 5187, pp. 93–104. Springer, Heidelberg (2008) 20. Dabek, F., Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Wide-area Cooperative Storage with CFS. In: Int. Symposium on Operating Systems Principles (SOSP), Banff, Canada (October 2001) 21. Dabek, F., Zhao, B.Y., Druschel, P., Kubiatowicz, J., Stoica, I.: Towards a Common API for Structured Peer-to-Peer Overlays. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003) 22. Daswani, N., Garcia-Molina, H., Yang, B.: Open Problems in Data-Sharing Peerto-Peer Systems. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 1–15. Springer, Heidelberg (2002) 23. d’Orazio, L., Jouanot, F., Labb´e, C., Roncancio, C.: Building Adaptable Cache Services. In: Int. Workshop on Middleware for Grid Computing (MGC), Grenoble, France (November 2005) 24. Dragan, F., Gardarin, G., Nguyen, B., Yeh, L.: On Indexing Multidimensional Values in A P2P Architecture. In: French Conf. on Bases de Donn´ees Avanc´ees (BDA), Lille, France (2006) 25. Endsuleit, R., Mie, T.: Censorship-Resistant and Anonymous P2P Filesharing. In: Int. Conf. on Availability, Reliability and Security (ARES), Vienna, Austria (April 2006) 26. Furtado, P.: Schemas and Queries over P2P. In: Andersen, K.V., Debenham, J., Wagner, R. (eds.) DEXA 2005. LNCS, vol. 3588, pp. 808–817. Springer, Heidelberg (2005) 27. Galanis, L., Wang, Y., Jeffery, S., DeWitt, D.: Locating Data Sources in Large Distributed Systems. In: Int. Conf. on Very Large Databases (VLDB), Berlin, Germany (September 2003) 28. Garc´es-Erice, L., Felber, P., Biersack, E., Urvoy-Keller, G.: Data Indexing in Peerto-Peer DHT Networks. In: Int. Conf. on Distributed Computing Systems (ICDCS), Columbus, Ohio, USA (June 2004) 29. Gnawali, O.: A Keyword-Set Search System for Peer-to-Peer Networks. Master thesis, Massachusetts Institute Of Technology, Massachusetts, USA (June 2002) 30. Harvey, N., Jones, M., Saroiu, S., Theimer, M., Wolman, A.: SkipNet: A Scalable Overlay Network with Practical Locality Properties. In: Int. Symposium on Internet Technologies and Systems (USITS), Washington, USA (March 2003)
350
C. Roncancio et al.
31. Hazel, S., Wiley, B., Wiley, O.: Achord: A Variant of the Chord Lookup Service for Use in Censorship Resistant Peer-to-Peer Publishing Systems. In: Int. Workshop on Peer To Peer Systems (IPTPS), Cambridge, MA, USA (March 2002) 32. Huebsch, R.: PIER: Internet Scale P2P Query Processing with Distributed Hash Tables. Phd thesis, EECS Department, University of California, Berkeley, California, USA (May 2008) 33. Huebsch, R., Chun, B., Hellerstein, J., Loo, B., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., Ymerefendi, A.: The Architecture of PIER: An Internet-Scale Query Processor. In: Int. Conf. on Innovative Data Systems Research (CIDR), California, USA (January 2005) 34. Huebsch, R., Hellerstein, J., Lanham, N., Loo, B., Shenker, S., Stoica, I.: Querying the Internet with PIER. In: Int. Conf. on Very Large Databases (VLDB), Berlin, Germany (September 2003) 35. Hunter, D.: Initiation XML. Editions Eyrolles (2001) 36. Iyer, S., Rowstron, A., Drushchel, P.: Squirrel - A Decentralized Peer-to-Peer Web Cache. In: Int. Symposium on Principles of Distributed Computing (PODC), California, USA (July 2002) 37. Jagadish, H., Ooi, B., Vu, Q.: Baton: A Balanced Tree Structure for Peer-to-Peer Networks. In: Int. Conf. on Very Large Databases (VLDB), Trondheim, Norway (September 2005) 38. Jamard, C., Gardarin, G., Yeh, L.: Indexing Textual XML in P2P Networks Using Distributed Bloom Filters. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 1007–1012. Springer, Heidelberg (2007) 39. Jawad, M., Serrano-Alvarado, P., Valduriez, P.: Design of PriServ, A Privacy Service for DHTs. In: Int. Workshop on Privacy and Anonymity in the Information Society (PAIS), Nantes, France (March 2008) 40. Jawad, M., Serrano-Alvarado, P., Valduriez, P., Drapeau, S.: Data Privacy in Structured P2P Systems with PriServ (May 2009) (submitted paper) 41. Jouanot, F., D’Orazio, L., Roncancio, C.: Context-Aware Cache Management in Grid Middleware. In: Hameurlain, A. (ed.) Globe 2008. LNCS, vol. 5187, pp. 34–45. Springer, Heidelberg (2008) 42. Judd, D.D.: Geocollaboration using Peer-Peer GIS (May 2005), http://www.directionsmag.com/article.php?article_id=850 43. Kossmann, D.: The State of the Art in Distributed Query Processing. ACM Computing Surveys 32(4) (2000) 44. Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.: OceanStore: An Architecture for Global-Scale Persistent Storage. In: Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cambridge, MA (November 2000) 45. Lesueur, F., M´e, L., Tong, V.V.T.: A Distributed Certification System for Structured P2P Networks. In: Hausheer, D., Sch¨ onw¨ alder, J. (eds.) AIMS 2008. LNCS, vol. 5127, pp. 40–52. Springer, Heidelberg (2008) 46. Li, Y., Jagadish, H.V., Tan, K.-L.: SPRITE: A Learning-Based Text Retrieval System in DHT Networks. In: Int. Conf. on Data Engineering, ICDE (2007) 47. Loo, B., Hellerstein, J., Huebsch, R., Shenker, S., Stoica, I.: Enhancing P2P FileSharing with an Internet-Scale Query Processor. In: Int. Conf. on Very Large Databases (VLDB), Toronto, Canada, August-September (2004)
Data Sharing in DHT Based P2P Systems
351
48. Lua, E.K., Crowcroft, J., Pias, M., Sharma, R., Lim, S.: A Survey and Comparison of Peer-to-Peer Overlay Network Schemes. IEEE Communications Surveys and Tutorials 7 (2005) 49. Malkhi, D., Naor, M., Ratajczak, D.: Viceroy: A Scalable and Dynamic Emulation of the Butterfly. In: Int. Symposium on Principles of Distributed Computing (PODC), Monterey, CA, USA (July 2002) 50. Marti, S., Garcia-Molina, H.: Taxonomy of Trust: Categorizing P2P Reputation Systems. Computer Networks 50(4) (2006) 51. Michel, S.: Top-k Aggregation Queries in Large-Scale Distributed Systems. Phd thesis, Saarland University, Saarbrucken, Germany (May 2007) 52. Molina, H., Ullman, J., Widom, J.: Database System Implementation. PrenticeHall, Englewood Cliffs (2000) 53. Mondal, A., Madria, S.K., Kitsuregawa, M.: CLEAR: An Efficient Context and Location-Based Dynamic Replication Scheme for Mobile-P2P Networks. In: Bressan, S., K¨ ung, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 399–408. Springer, Heidelberg (2006) 54. Ntarmos, N., Triantafillou, P., Weikum, G.: Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. In: Int. Conf. on Data Engineering (ICDE), Atlanta, USA (April 2006) 55. Open-Source Search Engine. YACY (2009), http://yacy.net/ 56. P2P Streaming. Joost (2009), http://www.joost.com/ 57. Petkovic, M., Jonker, W.W. (eds.): Security, Privacy, and Trust in Modern Data Management. Data-Centric Systems and Applications. Springer, Heidelberg (2007) 58. Prada, C.: Servicio para Manejar Estad´ısticas en Sistemas P2P Basados en DHT. Master thesis, Universidad de los Andes, Bogota, Colombia (January 2009) 59. Prada, C., Roncancio, C., Labb´ee, C., Villamil, M.P.: Semantic Caching Proposal in a P2P Querying System. In: Congreso Latinoamericano de Computaci´ on de Alto Rendimiento, Santa Marta, Colombia (June 2007) 60. Prada, C., Villamil, M., Roncancio, C.: Join Queries in P2P DHT Systems. In: Int. Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P), Auckland, New Zealand (August 2008) 61. Ramabhadran, S., Ratnasamy, S., Hellerstein, J., Shenker, S.: Prefix Hash Trees An Indexing Data Structure Over Distributed Hash Tables (2004), http://berkeley.intel-research.net/sylvia/pht.pdf 62. Ramachandran, A., Feamster, N.: Authenticated Out-of-Band Communication Over Social Links. In: Int. Workshop on Online social networks (WOSN), Seattle, WA, USA (August 2008) 63. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A Scalable Content Addressable Network. In: Int. Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), San Diego, CA, USA (August 2001) 64. Reynolds, P., Vahdat, A.: Efficient Peer-to-Peer Keyword Searching. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672. Springer, Heidelberg (2003) 65. Rice University Houston, USA. FreePastry (2002), http://freepastry.rice.edu/FreePastry/ 66. Rowstron, A., Druschel, P.: Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)
352
C. Roncancio et al.
67. Rowstron, A., Druschel, P.: Storage Management and Caching in PAST, A Largescale, Persistent Peer-to-Peer Storage Utility. In: Int. Symposium on Operating Systems Principles (SOSP), Banff, Canada (October 2001) 68. Sahin, O., Gupta, A., Agrawal, D., El-Abbadi, A.: A Peer-to-Peer Framework for Caching Range Queries. In: Int. Conf. on Data Engineering (ICDE), Boston, USA, March-April (2004) 69. Serjantov, A.: Anonymizing Censorship Resistant Systems. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, p. 111. Springer, Heidelberg (2002) 70. Shing, S., Yang, G., Wang, D., Yu, J., Qu, S., Chen, M.: Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning. In: Voelker, G.M., Shenker, S. (eds.) IPTPS 2004. LNCS, vol. 3279, pp. 151–161. Springer, Heidelberg (2005) 71. Sit, E., Morris, R.: Security Considerations for Peer-to-Peer Distributed Hash Tables. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, p. 261. Springer, Heidelberg (2002) 72. Skobeltsyn, G., Aberer, K.: Distributed Cache Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks. In: Int. Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR), Arlington, USA (November 2006) 73. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications. In: Int. Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), San Diego, CA, USA (August 2001) 74. Triantafillou, P., Pitoura, T.: Toward a Unifying Framework for Complex Query Processing over Structured Peer-to-Peer Data Networks. In: Int. Workshop on Databases, Information Systems, and Peer-to-Peer Computing (DBISP2P), Berlin, Germany (September 2003) 75. Villamil, M.: Service de Localisation de Donn´ees pour les Syst`emes P2P. Phd thesis, Institut National Polytechnique de Grenoble, Grenoble, France (June 2006) 76. Villamil, M., Roncancio, C., Labb´e, C.: PinS: Peer to Peer Interrogation and Indexing System. In: Int. Database Engineering and Applications Symposium (IDEAS), Coimbra, Portugal (June 2004) 77. Villamil, M., Roncancio, C., Labb´e, C.: Querying in Massively Distributed Storage Systems. In: Journ´ees Bases de Donn´ees Avanc´ees, Saint-Malo, France (October 2005) 78. WSDL. Web Services Description Language (WSDL) 1.1 (2001), http://www.w3.org/TR/wsdl 79. Wu, S., Li, J., Ooi, B., Tan, K.-L.: Just-in-Time Query Retrieval over Partially Indexed Data on Structured P2P Overlays. In: Int. Conf. on Management of Data (SIGMOD), Vancouver, Canada (June 2008) 80. Zhao, B., Huang, L., Stribling, J., Rhea, S., Joseph, A., Kubiatowicz, J.: Tapestry: A Resilient Global-scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications 22(1) (2004) 81. Zhu, Y., Hu, Y.: Efficient Semantic Search on DHT Overlays. Parallel and Distributed Computing 67(5) (2007)
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks Quoc Thai Tran1, David Taniar1, and Maytham Safar2 1
Clayton School of Information Technology, Monash University, Australia [email protected], [email protected] 2 Computer Engineering Department, Kuwait University, Kuwait [email protected]
Abstract. One of the problems that arise in geographical information systems is finding objects that are influenced by other objects. While most research focuses on kNN (k Nearest Neighbor) and RNN (Reverse Nearest Neighbor) queries, an important type of proximity queries called Reverse Farthest Neighbor (RFN) has not received much attention. Since our previous work shows that kNN and RNN queries in spatial network databases can be efficiently solved using Network Voronoi Diagram (NVD), in this paper, we aim to introduce a new approach to process reverse proximity queries including RFN and RkNN/RkFN queries. Our approach is based on NVD and precomputation of network distances, and is applicable for spatial road network maps. Being the most fundamental Voronoi-based approach for RFN and RkNN/RkFN queries, our solutions show that they can be efficiently used for networks that have a low and medium level of density.
1 Introduction Due to the rapid growth of information technologies in the twenty-first century, the original use of geographic maps has evolved. Although printed maps are still useful, many people today use electronic maps. People use maps not only for directions from one place to another, but also for decision making. For example, users may ask questions like “What is the shortest path to go from A to B?” or “What is the nearest train station to a shopping centre?”. Taking this into account, numerous algorithms for k Nearest Neighbor (kNN) queries have been studied in literature (Roussopoulos, 1995; Kolahdouzan and Shahabi, 2005; Safar, 2005). As the result, many navigation systems are now enhanced to support kNN searches. In addition to kNN queries, there are other types of proximity queries called Reverse Nearest Neighbor (RNN) and Reverse Farthest Neighbor (RFN) queries (Korn and Muthukrishnan, 2002; Kumar et al, 2008). An RNN query is, for example, to find residential that consider a restaurant as the nearest restaurant. Therefore, RNN search is used to find other places which are most affected by a given location. An RFN search is, in contrast, is to find places that are least affected by a given location. For example, a real estate company may want to know which properties is least affected by a road construction. A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 353–372, 2009. © Springer-Verlag Berlin Heidelberg 2009
354
Q.T. Tran, D. Taniar, and M. Safar
Basic RNN query can be written as R1NN that retrieves interest objects which consider the query points as the only nearest neighbor. The generalization of RNN search is termed RkNN where k > 1. In RkNN, the query point is not the only nearest neighbor, but instead, it is one of the nearest neighbors of interest objects retrieved. Likewise, the basic RFN can be generalized to RkFN where k > 1 and it is used to find objects that consider the query point as one of the farthest neighbors. Therefore, in RkNN and RkFN, the distance from an object to the query point determines the degree of influence by the query point to that object. Although these queries are commonly required, finding an efficient way to process these queries has been a problem in geographical information systems and spatial databases. Many researches focus on kNN and its variants while the reverse proximity queries are often neglected. It is common that these approaches assume the freedom of movement of objects. Note that when there is an underlying road network, the movement of objects is restricted to pre-defined roads by the underlying network (Papadias, 2003; Jensen et al, 2003). The spatial network distance from A to B is possibly greater than the Euclidean distance from A to B (Samet et al, 2008). Therefore, methods that are developed for geometry studies can be significantly wrong on spatial road networks where the network distances between objects must be used instead of their Euclidean distances. Hence, in this paper, we use spatial network, and we aim to introduce an efficient approach to process RNN and RFN queries in spatial network databases. Figure 1 shows the context of this paper. R1NN has been addressed in our previous work (Safer et al, 2009), in which we have successfully used Network Voronoi Diagram (Okabe et al, 2000), PINE expansion algorithm (Safar, 2005), and the precomputed network distances between objects (Koulahdouzan and Shahabi, 2004). This paper extends our previous work by focusing on RkNN, as well as R1FN and RkFN.
RNN RFN
k=1 Our previous work Section 4.2
k>1 Section 4.1 Section 4.3
Fig. 1. Reverse Nearest Neighbor (RNN) and Reverse Farthest Neighbor (RFN)
The remainder of this paper is organized as follows: Section 2 summarizes important existing work on kNN and RNN queries. Section 3 describes some preliminaries including our previous work on RNN. Section 4 presents our new approaches and algorithms for RkNN, RFN and RkFN. Section 5 shows and discusses the experimental results of the proposed algorithms. Finally, section 6 concludes the paper and explains possible future work.
2 Related Work Existing work on proximity query processing can be categorized into two main groups. The first group focuses on k Nearest Neighbor (kNN) and its variants, and second group concentrates on Reverse Nearest Neighbor (RNN) queries.
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
355
A kNN query is used to find the nearest neighbors to a given object. Since it is useful in many applications, many approaches have been introduced to support kNN. One of the well-known approaches for kNN is to use ‘branch-and-bound R-tree traversal’ algorithm developed by Roussopoulos et al (1995). This algorithm was soon followed by another algorithm developed by (Korn et al, 1996) to accelerate the performance of kNN search. To achieve even faster performance, Sield and Krigel (1998) produce an algorithm called the ‘multi-step’ algorithm to reduce the number of candidates. While most early algorithms focuses kNN in Euclidean spaces, Papadias et al (2003) introduces new methods to process various types of spatial queries including kNN, range search, closest pairs and e-distance joins in spatial network databases using both network and Euclidean information of objects. Since Voronoi diagram has been successfully used to solve problems in many applications, it is used as an alternative approach to solve problems in spatial analysis. Using Voronoi diagram and pre-calculated network distances between objects, Kolahdouzan and Shahabi (2004) proposes an approach which was termed ‘Voronoi Network-Based Nearest Neighbor’ (VN3) to find exact kNN of the query object. In this approach, the pre-calculated distances are both network distances within and across the Voronoi polygons. As the pre-computation of network distances can be expensive when the density is high, Safar (2005) proposes a novel approach called ‘Progressive Incremental Network Expansion’ (PINE) algorithm using network Voronoi diagram and Dijkstra’s algorithm (Dijkstra, 1959). Unlike VN3, PINE stores only network distances between border points of the Voronoi polygons. Since Network Voronoi Diagram has been successfully used to solve problems of kNN, we continue using it for RNN/RFN. Note that in the above approaches, the query point is only a single point and thus, the query is sometimes referred as a ‘single-source skyline query’. On the other hand, a ‘multisource skyline query’ is where the answer is found in response to multiple query points at the same time. An example of this query is to find news agencies that are closest to the train station, the bus stop and the car park. While ‘multi-source skyline queries’ are necessary in many applications, the first approach for processing these queries using network distances is found in (Deng et al, 2007). In addition to kNN, there is a growing number of works on RNN. It is common that these approaches produce approximate results for RNN and they are not applicable to spatial network databases. An early discussion about RNN queries and its variants is found in Korn and Muthukrishnan (2000). This paper proposes a general approach for RNN queries and a method for large data sets using R-tree. Yang et al (2001) introduces a single index structure called ‘Rdnn-tree’ to replace the multiple indexes used in the previous work. This index structure can be used for both RNN and NN queries. A new concept, called ‘Reverse Skyline Queries’, is introduced by Dellis and Seeger (2007) to find RNN of any given query point. It uses the ‘Branch and Bound’ algorithm to retrieve a candidate point set and refines this set to find the exact answers. Kumar et al (2008) introduces another method to process reverse proximity queries in two and three dimensions using a lifting transformation technique. However, the results are only approximate. Although these methods are helpful in geographical and spatial studies, they cannot be used for spatial network problems. Our previous work (Safar et al, 2009) discusses various types of RNN queries and proposes several algorithms to process these queries in spatial network databases. These algorithms are
356
Q.T. Tran, D. Taniar, and M. Safar
based on Network Voronoi Diagram, PINE algorithm and the pre-computation of network distances. Similar to nearest neighbor, the basic RNN query can be generalized to find objects that consider the query point as one of the k nearest neighbors. We call this an RkNN query where k can be any number given at the query time. Though there are few methods to find RkNN in literature (Achtert et al, 2006; Xia et al, 2005; Tao et al, 2004), methods that use Network Voronoi Diagram to find exact results have not been studied. Thus, in this paper, we focus on using Network Voronoi Diagram for processing RkNN queries. In addition, we also introduce other types of reverse proximity queries, termed RFN (Reverse Farthest Neighbor) and RkFN.
3 Preliminaries This section aims to provide some preliminary remarks on various types of queries including Reverse Nearest/Farthest Neighbor and Reverse k Nearest Neighbor queries. Since our approaches for these query types are based on Network Voronoi Diagram, the principles of Voronoi diagram are also reviewed in this section. The discussion of Voronoi diagram starts with Voronoi diagram for two dimensional Euclidean spaces. Then, we focus our discussion on Network Voronoi Diagram where real network distances between objects are used in place of Euclidean distances. Next, we highlight some properties of Voronoi diagram to use for our approaches. A brief description of our previous work on RNN queries is also provided at the end of this section. 3.1 Reverse Nearest/Farthest Neighbor Queries We start this section with the explanation of some terminologies used in this paper. These terminologies include interest point (or object) and query point (or object). Definition 1. An interest object is any object on a network and it is of interest to users. An interest point is where the interest object is located. We use the terms interest objects and interest points interchangeably. Definition 2. A query object is an object on the network and its influence on interest objects is determined as the query is called. A query point is where the query object is located. Therefore, we use the terms query point and query object interchangeably in this paper. Next, we define the basic Reverse Nearest/Farthest Neighbor and Reverse k Nearest/Farthest Neighbor queries using the above terminologies. o
Type 1. Reverse Nearest Neighbor (RNN) query retrieves interest points (or objects) which consider the query point (or object) as their nearest neighbor. This query type is used to find places where the query point has most impact. Example: Consider a restaurant as the query point, an RNN query can be used to retrieve other restaurants which assign the query restaurant as the nearest neighbor.
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
o
357
Type 2. Reverse Farthest Neighbor (RFN) query retrieves interest points (or objects) which consider the query point (or object) as their farthest neighbor. This query type is used to find places where the query point has least impact. Example: Given a train station A, an RFN query can help the train manager to decide which other train stations consider A as the farthest train station.
o
Type 3. Reverse k Nearest Neighbor (RkNN) query is a generalization of basic RNN where the interest points (or objects) retrieved consider the query point as one of their nearest neighbors (k > 1) rather the only nearest neighbor. Example: Given k = 2, the RkNN query retrieves all restaurants which assign the query restaurant as one of the two nearest neighbors.
o
Type 4. Reverse k Farthest Neighbor (RkFN) query is a generalization of basic RFN where the query point (or object) is considered as one of the farthest neighbors (k > 1) by the interest points (or objects) retrieved. Example: Given k = 2 and a train station A as the query point, RkFN query tells the train manager which other train stations consider A as one of the two farthest train stations.
3.2 Voronoi Diagram and Network Voronoi Diagram A Voronoi diagram is an exhaustive collection of exclusive Voronoi polygons (Okabe et al, 2000). Each Voronoi polygon (VP) is associated with a generator point (GP). All locations in the Voronoi polygon are closer to the generator point of that polygon than any other generator point in the Voronoi diagram in Euclidean plane. The boundaries of the polygons are called Voronoi edges. Any location on the Voronoi edges can be assigned to more than one generator. Adjacent polygons are the Voronoi polygons that share the same edges and their generators are called adjacent generators. An example of Voronoi diagram is shown in Figure 2. Voronoi diagram has been used to solve spatial analysis problems (Aggarwal et al, 1990; Patroumpas et al, 2007; Dickerson and Goodrich, 2008). Our proposed algorithms are based on the following property of Voronoi diagram: “The nearest generator point of pi (e.g. pj) is among the generator points whose Voronoi polygons share similar Voronoi edges with VP(pi).” (Safar et al, 2009). Since movements of objects are based on the spatial road network, the real distance between two objects is not the Euclidean distance but the actual network distance. In order to find the exact answers for those problems, Network Voronoi Diagrams have been used (Okabe et al, 2000). Generally, Network Voronoi Diagram is generated using the actual network distances, not the Euclidean distance between objects. For this reason, the network rather than the space is divided into Voronoi polygons. All nodes and edges in a Voronoi polygon are closer to its generator point than to any other generator point (Safar et al, 2009). A formal definition of Network Voronoi Diagram is found in Papadias et al (2003) and Roussopoulos et al (1995): “A Network Voronoi Diagram, termed NVD, is defined as graphs and is a specialization of Voronoi diagrams, where the location of objects is restricted to the links that connect the nodes of the graph and the distance between objects is defined as their shortest path in the network rather than their Euclidean distance”.
358
Q.T. Tran, D. Taniar, and M. Safar
Voronoi Edge
Voronoi Polygon P7
Voronoi Generator
P2 B1
P1 Border Point
P3
P6
Voronoi Vertex
P8
P4
P5
Fig. 2. Voronoi Diagram
An example of Network Voronoi Diagram for three polygons is shown in Figure 3. In this example, ‘X’ are the generator points and small squares represent the road network. We use different colored lines (i.e. gray or black) to show different Voronoi polygons. Thus, all nodes and edges in the ‘black’ network Voronoi polygon are closer to its generator point than to other generator point in the ‘gray’ network Voronoi polygons. Note that while the Voronoi polygons in the Voronoi diagram have a convex polygon shape, the Voronoi polygons in the Network Voronoi Diagram might have irregular shape because it uses network distance rather than Euclidean distance.
Fig. 3. Network Voronoi Diagram (Graf and Winter, 2003)
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
359
3.3 Voronoi-Based Reverse Nearest Neighbor (RNN): Our Previous Work Our previous work (Safar et al, 2009) has identified four different types of RNN queries that incorporate interest objects, query point, and other static (or quasi-static) non-query objects. In RNN queries, the query point may be a generator point (GP) or a non-generator point (~GP). Likewise, the interest objects may also be the generator points or non-generator points. Based on this, four types of RNN queries have been classified: RNNGP(GP), RNNGP(~GP), RNN~GP(~GP), and RNN~GP(GP). However, for simplicity, for this work we will only discuss RNNGP(GP), where both query point and interest objects are generator points. For example, given an NVD whereby the generator points are restaurants, RNNRestaurant (Restaurant) is to find other restaurants that consider the query restaurant as their nearest neighbor (e.g. their nearest restaurant). Our developed algorithms to answer RNN queries depend on the existence of a Network Voronoi Diagram (NVD) and a set of pre-computed data (such as border-toborder, and border-to-generator distances). The system described in our previous work (Safar et al, 2009) creates a set of NVDs, one for each different interest point (e.g., NVD for restaurants, schools, etc), and developed new algorithms to answer RNN queries that utilize the previously created NVDs, pre-computed distances, and the PINE algorithm. RNNGP(GP) query type does not need any inner network distance calculations, since we already have pre-computed the NVD for the generator points. All the required information was computed and stored while generating the Network Voronoi Diagram. Both query objects and interest objects are the generator points of the Voronoi diagram, and thus all distances from the generator points to borders are known. The candidate interest objects for RNN belong to the set of the query adjacent polygons RNN ∈ {QueryAdjacentPolygons} (see Kolahdouzan and Shahabi (2004); Safar (2005) for details). The query first starts by using NVD to find the distances from the generator of the polygon and then the distances from those border points to adjacent generators. For example in Figure 2, if the query point “q” is the generator point P3, we need to find distance(q, B1) + distance(B1, P1). Once the neighboring generator points are reached, the algorithm starts a heap list with the distance(q, AdjacentGeneratorPoints) as the initial distance. The distances between all candidates are first measured (i.e. distance(P1,P2), distance(P1,P4)) using NVD generator-to-border and vice versa, thus eliminating the repetition of the calculation among them. To cut down the calculations even further, all distances between the polygons are compared to the shortest distance between the query q and its adjacent 1NN. If a path is found that is shorter than the qto-1NN then both interest objects are canceled because they might be considered as the nearest neighbors to each other, or in other words, they are closer to each other than to the query object. The new candidate interest objects are then set as query points and they start searching for their 1NN. Every newly found distance from the intersection point, border points and generator points are tested and compared to the first entry in the heap. If the distance is larger, then it is heaped as the second entry. However, if the distance is shorter, then the search stops in that direction and we set the polygon as ‘not’ the RNN to the query point. While RNN is a specialization of RkNN where k = 1, in this paper, we propose an RkNN algorithm which is based on RNN algorithms. For simplicity, we assume that both query point and interest objects are generator points.
360
Q.T. Tran, D. Taniar, and M. Safar
4 Proposed Algorithms 4.1 Reverse k Nearest Neighbor (RkNN) Search Consider a Network Voronoi Diagram where both query point q and interest points {P1, P2, …, Pn} are generator points. A set of interest points which assign q as one of the k nearest neighbors is denoted as RkNN(q), k > 1. Here, we highlight some key properties that are used in our approach. Property 1. If Px belongs to RkNN(q), then the number of points that are closer to Px than q is less than or equal to k-1 (Kumar et al, 2008). Property 2. In Kolahdouzan and Shahabi (2004) and Safar (2005), it has been shown that when k = 1, the candidate nearest neighbors of q, Cand_RNN(q), are among its adjacent polygons. Example. In Figure 4, when k = 1, if q is a query point located at P1, then Cand_RNN(q)={P2, P3,…, P7}. Based on this property, our observation shows that when k = 2, the candidate reverse nearest neighbors of q is the combination of Cand_RNN(q) and the adjacent polygons of each polygon in Cand_RNN(q). Thus, in Figure 4, we have Cand_2RNN(q)={P1, P2, …, P19}.
Fig. 4. Example of an RkNN query
Property 3. For any number k > 1 and a query point q, the candidate set of Reverse k Nearest neighbors of q, Cand_RkNN(q), includes all adjacent polygons resulted from the k expansions from q. Hence, this property is the generalization of property 2. Proof. This property can be proven by using contradiction. Consider Figure 4 as an example, when k = 2, if P20 ∈ Cand_2RNN(q) then q is considered as one of the two nearest neighbors of P20, q ∈ 2NN(P20) and the maximum number of polygons that
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
361
are closer to P20 than q is one. However, the shortest path from q to P20 must pass P9, P2, P7, P18, P8, P19. This means there are at least two polygons in {P9, P2, P7, P18, P8, P19} which are closer to P20 than q and therefore, q ∉ 2NN(P20) contradicts our initial assumption. As the result, in Figure 4, the candidate R2NNs of q include all adjacent polygons that are found from the two expansions from q. Taking into consideration the above properties, we develop an efficient approach to process RkNN queries in spatial network databases. Our approach includes three main steps which are described as follows: Step 1: Find polygon Px that contains the query point q. Step 2: Assign adjacent polygons of Px to be the first candidate of RkNN(q). Step 3: For each candidate polygon Py, find the k nearest neighbors of Py, kNN(Py), using PINE algorithm. The calculation of network distances between generators is based on: (i) the pre-computation of inner network distances and (ii) the border-to-border network distance calculation using Dijkstra’s algorithm. If kNN(Py) contains Px, then add Py to the result. Add adjacent polygons of Py to the candidate set. Repeat step 3 until k expansions have been made and return the result. Figure 5 shows the complete algorithm for RkNN queries. Algorithm RkNN(k, q) 1. Voronoi Polygon Px = contains(q) 2. Initialize an empty set to contain result polygons = ∅ 3. Initialize an empty candidate polygon set Cand_RkNN = ∅ 4. Create a new candidate polygon set New_Cand to contain all adjacent polygons AP(Px) 5. Initialize a do while loop condition i to zero 6. do 7. //Update candidate polygon set in which existing candidates are replaced 8. //by new polygons in the temporary candidate set. 9. Cand_RkNN = New_Cand \ (Cand_RkNN ∪ Px) 10. New_Cand = ∅ 11. for each Py in Cand_RkNN 12. //Call PINE algorithm to find the k nearest neighbors of Py 13. kNN(Py) = PINE(k, Py) 14. if Px ∈ kNN(Py) then //If Py considers Px as one of its kNNs 15. result = result ∪ Py //Update the result with Py 16. end if 17. //Expand the boundary to find new candidate polygons 18. New_Cand = New_Cand ∪ AP(Py) 19. end for 20. increase i by 1 21. while i < k //Restrict the expansion 22. return result Fig. 5. RkNN Algorithm
362
Q.T. Tran, D. Taniar, and M. Safar
To demonstrate our algorithm, we take Figure 4 as an example. Suppose that we have an NVD where the generator points {P1, P2, …, P21} are hospitals and the query point q located at P1. An RkNN query is used to find other hospitals that consider q as one of their k nearest neighbors. First, calling the function contains(q) returns P1 as the Voronoi polygon that contains q. Second, we create three empty sets for result polygons, candidate polygons and newly found polygons, respectively. We denote these sets by result, cand_RkNN, and new_cand. result=∅ cand_RkNN=∅ new_cand=∅ Next, we find adjacent polygons of P1 and add them to new_cand: new_cand={P2, P3, …, P7} Create a variable called i and set i=0. cand_RkNN= new_cand \ (cand_RkNN ∪ P1)={P2, P3, …, P7} new_cand=∅ For each polygon Py in cand_RkNN, we find the kNN of Py using PINE algorithm. For example, let us assume: kNN(P2) = PINE(2,P2) = {P3, P9} … kNN(P7) = PINE(2,P7) = {P1, P6} Since P1 ∈ kNN(P7), P7 is added to the result: result={P7, …, Pn}. Next, we find adjacent polygons of P7 and add those polygons to new_cand before we go for the next candidate polygon in the Cand_RkNN. new_cand= new_cand ∪ AP(P7) = {P2, P1, P19, P18, P6, P8, …, Pn} When every polygon in cand_RkNN is explored, we have: new_cand= new_cand ∪ AP(P7) = {P1, P2, …, P19} Here, we increase i by 1 and thus, i is now equal to 1. Since i
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
363
search, every interest object on the NVD is a candidate RFN(q) and therefore, must be examined. However, an important property of RFN queries highlighted in Kumar et al (2008) is given in the following. Algorithm RFN(q) 1. Voronoi Polygon Px = contains(q) 2. result = {Pi, …, Pj} \ Px where Pi, …, Pj ∈ NVD 3. Old_Cand(Px) = Px 4. New_Cand(Px) = AP(Px) //Adjacent polygons of Px 5. Cand(Px) = New_Cand(Px) \ Old_Cand(Px) 6. Do //Start the expansion from Px 7. for each Py in Cand(Px) 8. Compute d(Py, Px) 9. dmax = 0 10. Old_Cand(Py)={Px, Py} 11. New_Cand(Py)= AP(Py) //Adjacent polygons of Py 12. Cand(Py)= New_Cand(Py) \ Old_Cand(Py) 13. do //Start the expansion from Py 14. for each Pz in Cand(Py) 15. if d(Pz, Py)>dmax then 16. dmax= d(Pz, Py) 17. end if 18. //Add Pz to the old candidate polygons of Py 19. Old_Cand(Py) = Old_Cand(Py) ∪ Pz 20. //Find new candidate polygons Py in adjacent polygons of Pz 21. New_Cand(Py)=AP(Pz) ∪ New_Cand(Py) 22. end for 23. //Excl candidate found in new candidate polygons in expansion Py 24. Cand(Py) = New_Cand(Py) \ Old_Cand(Py) 25. while dmax = d(Py, Px) and Cand(Py) ≠ ∅ 26. //End the expansion from Py 27. if dmax > d(Py, Px) then 28. //When there exists a point Pz that is farther from Py than Px 29. Remove Py from the result 30. end if 31. //Add Py to the old candidate polygons of Px 32. Old_Cand(Px) = Old_Cand(Px) ∪ Py 33. //Find new candidate polygons of Px in the adjacent polygons of Py 34. New_Cand(Px) = AP(Py) ∪ New_Cand(Px) 35. end for 36. //Excl old candidate polygons from new candidate polygons in expansion Px 37. Cand(Px) = New_Cand(Px) \ Old_Cand(Px) //End the expansion from Px 38. while Cand(Px) ≠ ∅ 39. return result Fig. 6. RFN Algorithm
364
Q.T. Tran, D. Taniar, and M. Safar
Property 4. An interest object Pi is called a Reverse Farthest Neighbor of q if the distance from Pi to q is always greater or equal to Pi to any other object Pj ≠ Pi. Based on this property, our approach for RFN queries works as follows: Step 1: Find the polygon Px that contains the query point q. Step 2: Assign adjacent polygons of Px to be first candidate RFN(q). Step 3: For each candidate polygon Py, compute the distance from Py to Px and Py’s adjacent polygons using Dijkstra’s algorithm. Property 4 suggests that if Py is a RFN(q), then distance(Py, Px) ≥ distance(Py, Pz), for any Pz ≠ Py. Therefore, when we find any adjacent polygon of Py that is farther to Py than q, distance(Py, Pz) > distance(Py, q), we remove Py from the result and move to the next candidate polygon. Otherwise, we continue the expansion from Py until we find any polygon farthest than q or every candidate for the farthest neighbor of Py is reached. Then, add adjacent polygons of Py to the candidate RFN(q). Repeat step 3 until every polygon on the NVD is explored. The complete algorithm for RFN is shown in Figure 6. Here, we take an example of RFN query as shown in Figure 7. In this example, the NVD contains a set of interest points {P2, P3, …, P13} and a query point q located at P1. Note that both interest points and query points are generator points. The query RFN(q) is to find other interest points that consider q as the farthest neighbor.
Fig. 7. An example of RFN query
First of all, we retrieve the Voronoi polygon where q is located. In this case, it returns P1. Contains(q) = P1 Secondly, several empty sets are created and we name these sets as old_cand(P1), cand(P1) and new_cand(P1).
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
365
old_cand(P1)=∅ cand(P1)=∅ new_cand(P1)=∅ Next, we add P1 to old_cand(P1) and add adjacent polygons of P1 to new_cand(P1). old_cand(P1)={P1} new_cand(P1) ={P2, P3, P4} Now, we have cand(P1)= new_cand(P1) \ old_cand(P1) = {P2, P3, P4} For each polygon Py in cand(P1), the distance from Py to P1, d(Py, P1), is calculated using pre-computed network distances and Dijkstra’s algorithm. Also, dmax is set to 0. In the next step, we create several empty sets, namely old_cand(Py), cand(Py) and new_cand(Py). We assign Px and Py to the old_cand(Py) set and assign adjacent polygons of Py to the new_cand(Py) set. For example, when Py=P2: old_cand(P2)={P1, P2} new_cand(P2) ={P1, P3, P8} cand(P2) = new_cand(P2) \ old_cand(P2) = {P3, P8} Then, for each polygon Pz in cand(P2), the distance from Pz to P2, d(Pz, P2), is calculated and compared with the maximum possible network distance dmax. If d(Pz, P2)>dmax, then dmax= d(Pz, P2). For instance, when Pz=P3: old_cand(P2)={P1, P2, P3} new_cand(P2) ={P1, P2, P8, P7, P6, P4} cand(P2) = new_cand(P2) \ old_cand(P2) = { P8, P7, P6, P4 } Suppose dmax=d(P3, P2) > distance(P1, P2) which means P3 is farther from P2 than P1 and thus, P2 is not regarded as an RFN(q). Next, we add P3 to old_cand(P2) and add adjacent polygons of P3 to new_cand(P2) and go for the next polygon in cand(P2). When every polygon in cand(P2) is explored, we add P2 to old_cand(P1) and add adjacent polygons to new_cand(P1). We call it a filter/refinement process. This process is done repeatedly until there is no more polygon found in cand(P1). 4.3 Reverse k Farthest Neighbor (RkFN) Search Let {P1, P2, …, Pn} be interest points and q be a query point on a Network Voronoi Diagram, an RkFN(q) is a set of interest points which assign q as one of the k farthest neighbors, k > 1. Since our developed approach for RkFN is based on RkNN and RFN approaches, most of the properties used for RkNN and RFN are also applicable to RkFN. However, we introduce the following property which is specific to RkFN. Property 5: If Px belongs to RkFN(q), then there are at most k-1 points that are farther from Px than q (Kumar et al, 2008). The summary of our approach for RkFN is as follows: Step 1: Find the polygon Px that contains the query point q. Step 2: Assign adjacent polygons of Px to be first candidate RFN(q).
366
Q.T. Tran, D. Taniar, and M. Safar
Algorithm RkFN(k, q) 1. Voronoi Polygon Px = contains(q) 2. result = {Pi, …, Pj} \ Px where Pi, …, Pj NVD 3. Old_Cand(Px) = Px 4. New_Cand(Px) = AP(Px) //Adjacent polygons of Px 5. Cand(Px) = New_Cand(Px) \ Old_Cand(Px) 6. do //Start the expansion from Px 7. for each Py in Cand(Px) 8. //Compute the distance from Py to Px 9. Compute d(Py, Px) 10. dmax = 0 11. Old_Cand(Py)={Px, Py} 12. New_Cand(Py)= AP(Py) //Adjacent polygons of Py 13. Cand(Py)= New_Cand(Py) \ Old_Cand(Py) 14. do //Start the expansion from Py 15. kFN(Py) = k farthest polygons in Cand(Py) ∪ Px 16. for each Pz in Cand(Py) 17. //Add Pz to the old candidate polygons of Py 18. Old_Cand(Py) = Old_Cand(Py) ∪ Pz 19. //Find new candidate polygons of Py in the 20. //adjacent polygons of Pz 21. New_Cand(Py)=AP(Pz) ∪ New_Cand(Py) 22. end for 23. //Exclude the old candidate found in the new 24. //candidate polygons found in the expansion of Py 25. Cand(Py) = New_Cand(Py) \ Old_Cand(Py) 26. while Px ∈ kFN(Py) and Cand(Py) ≠ ∅ 27. //End the expansion from Py 28. if Px ∉ kFN(Py) then 29. //When there exist more than k-1 points that are farther 30. // from Py than Px 31. Remove Py from the result 32. end if 33. //Add Py to the old candidate polygons of Px 34. Old_Cand(Px) = Old_Cand(Px) ∪ Py 35. //Find new candidate polygons of Px in the adjacent polygons of Py 36. New_Cand(Px) = AP(Py) ∪ New_Cand(Px) 37. end for 38. //Exclude the old candidate polygons from the new candidate polygons 39. //found in the expansion of Px 40. Cand(Px) = New_Cand(Px) \ Old_Cand(Px) 41. while Cand(Px) ≠ ∅ //End the expansion from Px 42. return result
∈
Fig. 8. RkFN Algorithm
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
367
Step 3: For each candidate polygon Py, find the k farthest neighbors of Py, kFN(Py), starting from its adjacent polygons. A kFN query is to find a set of k interest objects that are considered as the farthest neighbors of the query object. If kFN(Py) does not contain q, then Py is not the RkFN(q) and jump to the next candidate polygon. Otherwise, continue the search for kFN(Py) until no more polygon is found. Update the candidate RFN(q) with adjacent polygons of Py. Repeat step 3 until every polygon on the NVD is explored. Figure 8 shows the complete algorithm for RkFN queries.
5 Performance Evaluation To evaluate the performance of our proposed algorithms, we carried out several experiments on a Pentium IV, 2.33GHz processor, 2GB RAM, running Windows XP. For our testing, we use real world data sets for navigation systems and GPS enabled devices from NavTech Inc. These consist of 110,000 links and 79,800 nodes to represent a real road network in downtown Los Angeles. We test our algorithms using different sets of interest points including hospitals, shopping centres, and parks. The experiment for each algorithm is performed as follow. First, we run each proposed query type RkNN, RFN and RkFN 20 times and for each time, we chose a random query point. Then, we calculated the average numbers of RNN/RFN results, their CPU time and memory accesses and use graphs to show our experimental results. The purpose of our experiments is to show how different factors such as the number of k and object density could affect the performance of our algorithms in terms of execution time and disk accesses. 5.1 Performance of Reverse k Nearest Neighbor Queries The performance for RkNN query algorithm is tested on different aspects including object density (e.g. from low to high densities) and the value of k. In our experiments, we use the term ‘Low Density’ if the quantity of objects is less than 20 (e.g. Hospitals) while we use ‘High Density’ if the quantity of objects is greater than 40 (e.g. Parks). In addition, when the term ‘Medium Density’ is used, we refer to objects’ quantity between 20 and 40 (e.g. Shopping Centers). Since we are interested in knowing the average numbers of RNN results, their execution time and memory accesses, we present them in a table as shown in Table 1. Also, Figure 9 is used to depict our experimental results for RkNN algorithm. Table 1. Performance of RkNN queries
368
Q.T. Tran, D. Taniar, and M. Safar
(a) CPU time (sec) vs. k
(b) Memory vs. k Fig. 9. Comparison of k and execution time and memory use for RkNN queries
As shown in Figure 9, as the value of k increases, the numbers of CPU time and memory accesses also increase. This can be explained as the value of k increases, it increases the number of candidate objects for reverse nearest neighbor and the amount of time and memory to process RNN query. However, these changes vary for different object densities. For objects that have low or medium density, there is a slightly increment in the number of execution time and memory whereas for high density objects, this increment is more considerable. Therefore, the higher the object density is, the more CPU time and memory are needed for our algorithm. In summary, although it shows a significant degrade in the performance as the density increases, our algorithm can still have good response time and reasonable use of memory in low and medium object densities. 5.2 Performance of Reverse k Farthest Neighbor Queries Similar to RkNN, the performance of RFN and RkFN queries is evaluated in terms of number of results, execution time and memory use. Also, our testing is based on using different values of k and densities of objects. We run the query 20 times and assign a random query point for each query. Our experimental results are calculated on average and are summarized in Table 2. We also use Figure 10 to depict our results. In Figure 10, the amount of execution time and memory is shown on x axis while the values of k are shown on y axis. It shows that on average, the response time and
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
369
memory used for RFN queries increases as the number of k increases. This is because when the value of k is small, the number of resources that are needed to run the farthest neighbour query for each candidate is also small. For example, when k = 1, the network expansion from each candidate would stop as it could find any object that is farther than the query object. Thus, the execution of RFN query would be performed faster. Table 2. Performance of RFN and RkFN queries
(a) CPU time (sec) vs. k
(a) Memory vs. k Fig. 10. Comparison of k and execution time and memory use for RkFN queries
370
Q.T. Tran, D. Taniar, and M. Safar
Additionally, from our experiments with RkFN, we note that the amount of CPU time and memory need rises rapidly as the value of k increases when the number of objects is high. As explained in our algorithm, there is no restriction for the search space for RkFN as for RkNN queries. Every object on the network can be an RFN of the query object and thus, must be explored. This helps to explain why the performance of our algorithm for RkFN decreases as the number of objects increases. Nonetheless, our fundamental algorithm for RNN and RkFN queries can produce good performance on low and medium density data sets.
6 Conclusion and Future Work In this paper, we put an emphasis on three types of reverse proximity query, namely (i) Reverse k Nearest Neighbor, (ii) Reverse Farthest Neighbor, and (iii) Reverse k Farthest Neighbor, and their relations to each other. We also outlined important implication of these query types in geographical planning as opposed to their lack of attention by researchers. Our observation showed that in practice, the possible movement between objects must rely on pre-defined roads by an underlying network. Therefore, existing approaches for reverse proximity queries using Euclidean distances gives only estimated results. Taking this into account, we developed new approaches for RkNN, RFN and RkFN using Network Voronoi Diagram, PINE network expansion algorithm and pre-computed network distances, so that they can be applied to spatial road networks. Also, we extended the properties of network Voronoi polygons to find new constraints for the network expansion in RkNN/RkFN searches. Since location-aware systems are predicted to be widely used in the future, understanding how different spatial analysis problems can be solved using NVD would be an advance. The outcome of this paper would lead up to new interesting field of research in spatial network query processing and new applications to support RNN/RFN queries in the future. While objects in the basic RNN/RFN searches discussed in this paper are of same type, we plan to extend these queries to a biochromatic version, termed biochromatic RNN/RFN in our future paper.
Acknowledgments This research has been partially funded by the Australian Research Council (ARC) Discovery Project (Project No: DP0987687).
References Achtert, E., Bohm, C., Kroger, P., Kunath, P., Pryakhin, A., Renz, M.: Approximate reverse knearest neighbor queries in general metric spaces. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 788–789 (2006) Achtert, E., Bohm, C., Kroger, P., Kunath, P., Pryakhin, A., Renz, M.: Efficient reverse knearest neighbour search in arbitrary metric spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 515–526 (2006)
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks
371
Aggarwal, A., Hansen, M., Leighton, T.: Solving query-retrieval problems by compacting Voronoi diagrams. In: Proceedings of the 22nd annual ACM Symposium on Theory of Computing, pp. 331–340 (1990) Dellis, E., Seeger, B.: Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 291–302 (2007) Deng, K., Zhou, X., Shen, H.: Multi-source Skyline Query Processing in Road Networks. In: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, pp. 796– 805 (2007) Dickerson, M., Goodrich, M.: Two-site Voronoi diagrams in geographic networks. In: Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, vol. 59 (2008) Dijkstra, E.W.: A Note on Two Problems in Connexion with Graphs. Numeriche Mathematik 1(1), 269–271 (1959) Goh, J.Y., Taniar, D.: Mobile Data mining by Location Dependencies. In: Yang, Z.R., Yin, H., Everson, R.M. (eds.) IDEAL 2004. LNCS, vol. 3177, pp. 225–231. Springer, Heidelberg (2004) Goh, J., Taniar, D.: Mining Frequency Pattern from Mobile Users. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004, Part III. LNCS, vol. 3215, pp. 795–801. Springer, Heidelberg (2004) Graf, M., Winter, S.: Network Voronoi Diagram. Österreichische Zeitschrift für Vermessung und Geoinformation 91(3), 166–174 (2003) Jensen, C., Kolarvr, J., Pedersen, T., Timko, I.: Nearest neighbor queries in road networks. In: Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems, pp. 1–3 (2003) Korn, F., Muthukrishnan, S.: Influence sets based on reverse nearest neighbor queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29(2), pp. 201–212 (2000) Korn, F., Muthukrishnan, S., Srivastava, D.: Reverse nearest neighbor aggregates over data streams. In: Proceedings of the International Conference on Very Large Data Bases, pp. 814–825 (2002) Koulahdouzan, M., Shahabi, C.: Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. In: Proceedings of the 30th International Conference on Very Large Data Bases, vol. 30, pp. 840–851 (2004) Kumar, Y., Janardan, R., Gupta, P.: Efficient algorithms for reverse proximity query problems. In: Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, vol. 39 (2008) Okabe, A., Boots, B., Sugihara, K., Chiu, N.: Spatial Tessellations, Concepts and Applications of Voronoi Diagrams, 2nd edn. John Wiley and Sons Ltd., Chichester (2000) Papadias, D., Mamoulis, N., Zhang, J., Tao, Y.: Query Processing in Spatial Network Databases. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 802–813 (2003) Patroumpas, K., Minogiannis, T., Sellis, T.: Approximate order-k Voronoi cells over positional streams. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic Information Systems, vol. 36 (2007) Roussopoulos, N., Kelly, S., Vincent, F.: Nearest Neighbor Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 71–79 (1995) Safar, M.: K Nearest Neighbor Search in Navigation Systems. Mobile Information Systems 1(3), 207–224 (2005)
372
Q.T. Tran, D. Taniar, and M. Safar
Safar, M., Ebrahmi, D.: eDar Algorithm for Continous KNN queries based on PINE. International Journal of Information Technology and Web Engineering 1(4), 1–21 (2006) Safar, M., Ibrahimi, D., Taniar, D., Gavrilova, M.: Voronoi-based Reverse Nearest Neighbour Query Processing on Spatial Networks. Multimedia Systems Journal (in press, 2009) Samet, H., Sankaranarayanan, J., Alborzi, H.: Scalable network distance browsing in spatial databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 43–54 (2008) Sield, T., Kriegel, H.: Optimal Multi-Step k-Nearest Neighbor Search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 154–165 (1998) Stephen, R., Gihnea, G., Patel, M., Serif, T.: A context-aware Tour Guide: User implications. Mobile Information Systems 3(2), 71–88 (2007) Tao, Y., Papadias, D., Lian, X.: Reverse kNN search in arbitrary dimensionality. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 744–755 (2004) Waluyo, A., Srinivasan, B., Taniar, D.: Optimal Broadcast Channel for Data Dissemination in Mobile Database Environment. In: Zhou, X., Xu, M., Jähnichen, S., Cao, J. (eds.) APPT 2003. LNCS, vol. 2834, pp. 655–664. Springer, Heidelberg (2003) Waluyo, A., Srinivasan, B., Taniar, D.: A Taxanomy of Broadcast Indexing Schemes for Multi Channel Data Dissemination in Mobile Database. In: Proceedings of the 18th International Conference on Advanced Information Networking and Applications (AINA 2004), pp. 213– 218. IEEE Computer Society, Los Alamitos (2004) Waluyo, A., Srinivasan, B., Taniar, D.: Research in mobile database query optimization and processing. Mobile Information Systems 1(4), 225–252 (2005) Xia, C., Hsu, W., Lee, M.: ERkNN: efficient reverse k-nearest neighbors retrieval with local kNN distance estimation. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 533–540 (2005) Xuan, K., Zhao, G., Taniar, D., Srinivasan, B., Safar, M., Gavrilova, M.: Continous Range Search based on Network Voronoi Diagram. International Journal of Grid and Utility Computing (in press, 2009) Xuan, K., Zhao, G., Taniar, D., Srinivasan, B., Safar, M., Gavrilova, M.: Network Voronoi Diagram based Range Search. In: Proceedings of the 23rd IEEE International Conference on Advanced Information Networking and Applications, AINA (in press, 2009) Xuan, K., Zhao, G., Taniar, D., Srinivasan, B., Safar, M., Gavrilova, M.: Continous Range Search Query Processing in Mobile Navigation. In: Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems, ICPADS’s 2008, pp. 361–368. IEEE Computer Society, Los Alamitos (2008) Yang, C., Lin, K.: An Index Structure for Efficient Reverse Nearest Neighbor Queries. In: Proceedings of the 17th International Conference on Data Engineering, pp. 485–492 (2001) Zhao, G., Xuan, K., Taniar, D., Srinivasan, B.: Incremental K-Nearest Neighbor Search On Road Networks. Journal of Interconnection Networks 9(4), 455–470 (2008)
Author Index
Adam, Emmanuel 267 Afsarmanesh, Hamideh 1 Atzeni, Paolo 38
Labb´e, Cyril 327 Lambrinoudakis, Costas Leitao, Paulo 243, 267
Benslimane, Djamal 91 Bhide, Manish 289 B¨ ohm, Christian 63 Boukadi, Khouloud 91
Maamar, Zakaria 91 Mohania, Mukesh 289 Morvan, Franck 211 Msanjila, Simon S. 1
C´ amara, Javier 116 Camarinha-Matos, Luis M. 1 Cappellari, Paolo 38 Chakaravarthy, Venkatesan T. 289
Noll, Robert
Dabringer, Claus 156 del Pilar Villamil, Mar´ıa Draheim, Dirk 136 Eder, Johann 156 Ermilova, Ekaterina
327
191
Ghedira, Chirine 91 Gianforme, Giorgio 38 Gupta, Himanshu 289 Hameurlain, Abdelkader Kobsa, Alfred
116
63
Plant, Claudia
63
Roncancio, Claudia Roy, Prasan 289
327
Safar, Maytham 353 Schicho, Michaela 156 Serrano-Alvarado, Patricia Stark, Konrad 156
1
Fischer-Huebner, Simone Furnell, Steven 191
191
327
Taniar, David 353 Tran, Quoc Thai 353 Valckenaers, Paul 267 Vincent, Lucien 91
211
Wackersreuther, Bianca Zherdin, Andrew
63
63