Advances in Databases and Information Systems: 13th East European Conference, ADBIS 2009, Riga, Latvia, September 7-10, 2009, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Janis Grundspenkis | Tadeusz Morzy | Gottfried Vossen

10 downloads 1037 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5739

Janis Grundspenkis Tadeusz Morzy Gottfried Vossen (Eds.)

Advances in Databases and Information Systems 13th East European Conference, ADBIS 2009 Riga, Latvia, September 7-10, 2009 Proceedings

13

Volume Editors Janis Grundspenkis Riga Technical University Institute of Applied Computer Systems Kalku iela 1, LV 1658 Riga, Latvia E-mail: [email protected] Tadeusz Morzy University of Technology Institute of Computing Science , Piotrowo 2 60-965 Pozna´n, Poland E-mail: [email protected] Gottfried Vossen University of Münster Department of Information Systems Leonardo Campus 3, 48149 Münster, Germany E-mail: [email protected]

Library of Congress Control Number: Applied for CR Subject Classification (1998): H.2, H.3, K.8.1, C.2.4, J.1, H.5, I.7 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13

0302-9743 3-642-03972-3 Springer Berlin Heidelberg New York 978-3-642-03972-0 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12747030 06/3180 543210

Preface

These proceedings contain 25 contributed papers presented at the 13th EastEuropean Conference Advances on Databases and Information Systems (ADBIS 2009) held September 7-10, 2009, in Riga, Latvia. The Call for Papers attracted 93 submissions from 28 countries. In a rigorous reviewing process the international Program Committee of 64 members from 29 countries selected these 25 contributions for publication in this volume; in addition, there is the abstract of an invited talk by Matthias Brantner. Furthermore, 18 additional contributions were selected for short presentations and have been published in a separate volume of local proceedings by the organizing institution. Topically, the accepted papers cover a wide spectrum of database and information system topics ranging from query processing and optimization via query languages, design methods, data integration, indexing and caching to business processes, data mining, and application oriented topics like XML and data on the Web. The ADBIS 2009 conference continued the series of ADBIS conferences organized every year in diﬀerent countries of Eastern and Central Europe, beginning in St. Petersburg (Russia, 1997), Poznan (Poland, 1998), Maribor (Slovenia, 1999), Prague (Czech Republic, as a joint ADBIS-DASFAA conference, 2000), Vilnius (Lithuania, 2001), Bratislava (Slovakia, 2002), Dresden (Germany, 2003), Budapest (Hungary, 2004), Tallinn (Estonia, 2005), Thessaloniki (Greece, 2006), Varna (Bulgaria, 2007), and Pori (Finland, 2008). The conferences are initiated and supervised by an international Steering Committee, which consists of representatives from Armenia, Austria, Bulgaria, Czech Republic, Greece, Estonia, Germany, Hungary, Israel, Italy, Latvia, Lithuania, Poland, Russia, Serbia, Slovakia, Slovenia, and Ukraine, and is chaired by Professor Leonid Kalinichenko. The ADBIS conferences have established an outstanding reputation as a scientiﬁc event of high quality, serving as an international forum for the presentation, discussion, and dissemination of research achievements in the ﬁeld of databases and information systems. ADBIS 2009 aimed to promote interaction and collaboration between European research communities from all parts of Europe and the rest of the world. Additionally, ADBIS 2009 aimed to create conditions for experienced researchers to impart their knowledge and experience to the young researchers participating in the Doctoral Consortium organized in association with ADBIS 2009 conference. We would like to express our thanks to everyone who have contributed to the success of ADBIS 2009. We thank the authors, who submitted papers to the conference, the Program Committee members and external reviewers for ensuring the quality of the scientiﬁc program, all members of the local organizing team in Riga (Latvia) for giving their time and expertise to ensure the success of the conference, and, ﬁnally, Alfred Hofmann of Springer for accepting these proceedings for the LNCS series. The Program Committee work relied on

VI

Preface

EasyChair, which once again proved to be an exceptionally handy and convenient tool for this kind of work, so that we are also grateful to the people who created it and who maintain it. The Doctoral Consortium held during ADBIS 2009 was sponsored by the VLDB Endowment, which is gratefully acknowledged. Last but not least we thank the Steering Committee and, in particular, its Chair, Leonid Kalinichenko, for their help and guidance. September 2009

Janis Grundspenkis Tadeusz Morzy Gottfried Vossen

Conference Organization

General Chair Janis Grundspenkis

Riga Technical University, Latvia

Program Chairs Tadeusz Morzy Gottfried Vossen

Pozna´ n University of Technology, Poland University of M¨ unster, Germany

Program Committee Paolo Atzeni Guntis Barzdins Andreas Behrend Andras Benczur Maria Bielikova Bostjan Brumen Alina Campan Albertas Caplinskas Sharma Chakravarthy Alfredo Cuzzocrea Alin Deutsch Johann Eder Janis Eiduks Johann Gamper Jarek Gryz Hele-Mai Haav Theo H¨arder Mirjana Ivanovic Hannu Jaakkola Manfred Jeusfeld Leonid Kalinichenko Ahto Kalja Audris Kalnins Mehmed Kantardzic Marite Kirikova

Universit´a Roma Tre, Italy Institute of Mathematics and Computer Science, Latvia University of Bonn, Germany E¨otv¨ os Lor´ and University, Hungary Slovak University of Technology, Slovakia University Maribor, Slovenia Northern Kentucky University, USA Institute of Mathematics and Informatics, Lithuania University of Texas at Arlington, USA University of Calabria, Italy University of California San Diego, USA University of Klagenfurt, Austria Riga Technical University, Latvia Free University of Bozen-Bolzano, Italy York University, Canada Tallin Technical University, Estonia University of Kaiserslautern, Germany University of Novi Sad, Serbia Tampere University of Technology, Finland Tilburg University, The Netherlands Russian Academy of Science, Russia Tallin University of Technology, Estonia University of Latvia, Latvia University of Louisville, USA Riga Technical University, Latvia

VIII

Organization

Margita Kon-Popovska Sergei Kuznetsov

Cyril and Methodius University, FYROM Institute of System Programming of Russian Academy of Science, Russia Jens Lechtenb¨orger University of M¨ unster, Germany Nikos Mamoulis University of Hong Kong, China Yannis Manolopoulos Aristotle University of Thessaloniki, Greece Rainer Manthey University of Bonn, Germany Joris Mihaeli IBM Israel, Israel Pavol Navrat Slovak University of Technology, Slovakia Igor Nekrestyanov St. Petersburg State University, Russia Mykola Nikitchenko Kyiv National Taras Shevchenko University, Ukraine Kjetil Norvag Norwegian University of Science and Technology, Norway Boris Novikov St. Petersburg State University, Russia ¨ Gultekin Ozsoyoglu Case Western Reserve University, USA ¨ Tamer Ozsu University of Waterloo, Canada Evi Pitoura University of Ioannina, Greece Jaroslav Pokorny Charles University, Czech Republic Boris Rachev Technical University of Varna, Bulgaria Peter Revesz University of Nebraska, USA Tore Risch Uppsala University, Sweden Stefano Rizzi University of Bologna, Italy Peter Scheuermann Northwestern University, USA Timos Sellis National Technical University of Athens, Greece Vaclav Snasel Technical University of Ostrava, Czech Republic Eva Soderstrom University of Sk¨ ovde, Sweden Nicolas Spyratos University of Paris South, France Janis Stirna Royal Institute of Technology, Sweden Val Tannen University of Pennsylvania, USA Bernhard Thalheim Christian Albrechts University Kiel, Germany Juan-Carlos Trujillo Mondejar University of Alicante, Spain Maurice van Keulen University of Twente, The Netherlands Olegas Vasilecas Vilnius Gediminas Technical University, Lithunia Michael Vassilakopoulos University of Central Greece, Greece K. Vidyasankar Memorial University, Canada Gerhard Weikum Max-Planck-Institut f¨ ur Informatik, Germany Marek Wojciechowski Pozna´ n University of Technology, Poland Limsoon Wong National University of Singapore, Singapore Shuigeng Zhou Fudan University, China

Organization

Local Organization Chairman: Agris Nikitenko Dace Apshvalka Juris Borzovs Janis Eiduks Marite Kirikova Lilita Sparane Uldis Sukovskis Larisa Survilo

Riga Technical University, Latvia Riga Technical University, Latvia Latvian IT Cluster, Latvia Riga Technical University, Latvia Riga Technical University, Latvia Latvian IT Cluster, Latvia Riga Technical University, Latvia Riga Technical University, Latvia

External Reviewers Dmytro Buy Avram Eskenazi Algirdas Laukaitis Leonardo Ribeiro Andreea Sabau Sergejus Sosunovas Traian Marius Truta Hongmei Wang

ADBIS Steering Committee Chairman: Leonid Kalinichenko Russian Academy of Science, Russia Andras Benczur, Hungary Albertas Caplinskas, Lithuania Johann Eder, Austria Hele-Mai Haav, Estonia Mirjana Ivanovic, Serbia Marite Kirikova, Latvia Mikhail Kogalovsky, Russia Yannis Manolopoulos, Greece Rainer Manthey, Germany Manuk Manukyan, Armenia Joris Mihaeli, Israel Tadeusz Morzy, Poland Pavol Navrat, Slovakia Mykola Nikitchenko, Ukraine Boris Novikov, Russia Jaroslav Pokorny, Czech Republic Boris Rachev, Bulgaria Bernhard Thalheim, Germany Tatjana Welzer, Slovenia Viacheslav Wolfengagen, Russia Ester Zumpano, Italy

IX

Table of Contents

Invited Talk Sausalito: An Application Servers for RESTful Services in the Cloud . . . Matthias Brantner

1

Business Processes Versions to Address Business Process Flexibility Issue . . . . . . . . . . . . . . . . Mohamed Amine Chaˆ abane, Eric Andonoﬀ, Raﬁk Bouaziz, and Lotﬁ Bouzguenda A Rule-Based Modeling for the Description of Flexible and Self-healing Business Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Boukhebouze, Youssef Amghar, A¨ıcha-Nabila Benharkat, and Zakaria Maamar Business Process Aware IS Change Management in SMEs . . . . . . . . . . . . . Janis Makna

2

15

28

Design Issues Performance Driven Database Design for Scalable Web Applications . . . . Jozsef Patvarczki, Murali Mani, and Neil Heﬀernan

43

Generic Entity Resolution in Relational Databases . . . . . . . . . . . . . . . . . . . Csaba Istv´ an Sidl´ o

59

Tool Support for the Design and Management of Spatial Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nazario Cipriani, Matthias Wieland, Matthias Grossmann, and Daniela Nicklas

74

Advanced Query Processing Eﬃcient Set Similarity Joins Using Min-preﬁxes . . . . . . . . . . . . . . . . . . . . . Leonardo A. Ribeiro and Theo H¨ arder

88

Probabilistic Granule-Based Inside and Nearest Neighbor Queries . . . . . . Sergio Ilarri, Antonio Corral, Carlos Bobed, and Eduardo Mena

103

Window Update Patterns in Stream Operators . . . . . . . . . . . . . . . . . . . . . . Kostas Patroumpas and Timos Sellis

118

XII

Table of Contents

Query Processing and Optimization Systematic Exploration of Eﬃcient Query Plans for Automated Database Restructuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maxim Kormilitsin, Rada Chirkova, Yahya Fathi, and Matthias Stallmann Using Structural Joins and Holistic Twig Joins for Native XML Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas M. Weiner and Theo H¨ arder Approximate Rewriting of Queries Using Views . . . . . . . . . . . . . . . . . . . . . . Foto Afrati, Manik Chandrachud, Rada Chirkova, and Prasenjit Mitra

133

149 164

Query Languages SQL Triggers Reacting on Time Events: An Extension Proposal . . . . . . . Andreas Behrend, Christian Dorau, and Rainer Manthey

179

Pushing Predicates into Recursive SQL Common Table Expressions . . . Marta Burza´ nska, Krzysztof Stencel, and Piotr Wi´sniewski

194

On Containment of Conjunctive Queries with Negation . . . . . . . . . . . . . . . Victor Felea

206

Indexing and Caching Optimizing Maintenance of Constraint-Based Database Caches . . . . . . . . Joachim Klein and Susanne Braun The Onion-Tree: Quick Indexing of Complex Data in the Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Caio C´esar Mori Car´elo, Ives Renˆe Venturini Pola, Ricardo Rodrigues Ciferri, Agma Juci Machado Traina, Caetano Traina-Jr., and Cristina Dutra de Aguiar Ciferri

219

235

Data Integration Cost-Based Vectorization of Instance-Based Integration Processes . . . . . . Matthias Boehm, Dirk Habich, Steﬀen Preissler, Wolfgang Lehner, and Uwe Wloka

253

Empowering Provenance in Data Integration . . . . . . . . . . . . . . . . . . . . . . . . Haridimos Kondylakis, Martin Doerr, and Dimitris Plexousakis

270

Table of Contents

XIII

Applications Detecting Moving Objects in Noisy Radar Data Using a Relational Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Behrend, Rainer Manthey, Gereon Sch¨ uller, and Monika Wieneke Study of Dependencies in Executions of E-Contract Activities . . . . . . . . . K. Vidyasankar, P. Radha Krishna, and Kamalakar Karlapalem Object Tag Architecture for Innovative Intelligent Transportation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krishan Sabaragamu Koralalage and Noriaki Yoshiura

286

301

314

Portpourri Conceptual Universal Database Language: Moving Up the Database Design Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikitas N. Karanikolas and Michael Gr. Vassilakopoulos

330

Temporal Data Classiﬁcation Using Linear Classiﬁers . . . . . . . . . . . . . . . . . Peter Revesz and Thomas Triplet

347

SPAX – PAX with Super-Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel B¨ oßwetter

362

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

379

Sausalito: An Application Servers for RESTful Services in the Cloud Matthias Brantner 28msec GmbH Zurich, Switzerland [email protected]

This talk argues that Web Server, Application Server, and Database System should be bundled into a single system for development and deployment of Web-based applications in the cloud. Furthermore, this talk argues that the whole system should serve REST services and should behave like a REST service itself. The design and implementation of Sausalito is presented which is a combined Web, Application, and Database server that operates on top of Amazon’s cloud oﬀerings. Furthermore, a demo of several example applications is given that show the advantages of the approach taken by Sausalito (see http://sausalito.28msec.com/).

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, p. 1, 2009. c Springer-Verlag Berlin Heidelberg 2009

Versions to Address Business Process Flexibility Issue Mohamed Amine Chaâbane1, Eric Andonoff2, Rafik Bouaziz1, and Lotfi Bouzguenda1 1

MIRACL/ISIMS, Route de l’aéroport, BP 1088, 3018 Sfax, Tunisia {MA.Chaabane,Raf.Bouaziz}@fsegs.rnu.tn, [email protected] 2 IRIT/UT1, 2 rue du Doyen Gabriel Marty, 31042 Toulouse Cedex, France [email protected]

Abstract. This paper contributes to address an important issue in business process management: the Business Process (BP) flexibility issue. First, it defends that versions are an interesting solution to deal with both a priori (when designing BPs) and a posteriori (when executing BPs) flexibility. It also explains why previous contributions about versions of BPs are incomplete, and need to be revisited. Then, the paper presents a meta-model for BP versions, which combines five perspectives -the functional, process, informational, organizational and operation perspectives- for BP modelling, and which allows a comprehensive description of versionalized BPs. Keyword: Business Processes, Flexibility, Versions.

1 Introduction The importance of Business Processes (BPs) in enterprises and organizations is widely recognized, and BPs are nowadays considered as first class entities both when designing and implementing Information Systems [1,2]. Last years, important advances have been done in Business Process area, and several systems, ranging from groupware systems to (service-oriented) workflow management systems, are now available for the design and execution of BPs. However, the effectiveness of BPs in Information Systems is not yet achieved, and several challenging issues are still to be addressed. One of the most important is BP flexibility [3]. Indeed, economic competition in which enterprises and organizations are involved nowadays leads them to often change and adapt their BPs to meet, as quickly and effectively as possible, new organizational, operational or customer requirements. So, researchers in BP area are widely interested in BP flexibility, and tutorials and tracks of several conferences and workshops are dedicated to this topic [4,5]. Literature provides several definitions of BP flexibility. For instance in [6], flexibility is defined as the ability to deal with both foreseen and unforeseen changes in the environment in which business processes operate. In [7], flexibility is defined as the capacity of making a compromise between, first, satisfying rapidly and easily the business requirements in terms of ability when organizational, functional and/or operational changes occur, and second keeping effectiveness. So far, despites the efforts of the BP community, there is not yet an agreement on BP flexibility. However, two main classifications were J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 2–14, 2009. © Springer-Verlag Berlin Heidelberg 2009

Versions to Address Business Process Flexibility Issue

3

proposed last year in order to highlight this notion, to assess proposed solutions and show the way in order to effectively achieve this issue [6,7]. [6] provides a comprehensive overview of implemented solutions for BP flexibility: several systems, mainly workflow management systems, are compared according criteria which define a taxonomy of flexibility. In [7], a state of the art for modelling BP is given and several flexibility criteria are defined for comparing the used modelling approaches. However, even if these classifications are different from one another, they share BP flexibility study with respect to BP lifecycle, and identify two main times of flexibility: an a priori flexibility (when designing BP) and an a posteriori (when executing BP) flexibility. In addition to these classifications, [6] and [7] also indicate that an interesting direction for dealing with BP flexibility is to consider a declarative approach. Some contributions constitute steps in this direction [8,9,10]: some adopt a rule-based approach while others advocate a context aware-based one. However, models used for designing and specifying BP in the main (service-oriented) BP management systems are activity-oriented models [6]. Consequently, BP community has to provide solutions to deal with activity-oriented BP flexibility. In this paper, we defend that versioning as an interesting solution to deal with (activity-oriented) BP flexibility. More precisely, versions of BP are useful to deal with some cases of both a priori and a posteriori flexibility. Versioning is used in several fields of computer science in which is highlighted the need to describe evolution of entities over time. Thus, versions are used in databases [11], in software engineering to handle software configurations [12], and also in conceptual models such as the Entity Relationship model for instance [13]. Some efforts have also been put on version management in the BP context, and partial solutions to BP version modeling and instance adaptation and migration are proposed in literature. These solutions have in common the adoption of an activity-oriented based approach to design BPs. Proposed solutions define a set of operations supporting both BP schema change, and adaptation and migration of their corresponding instances [14,15]. ADEPT2 [16] is probably the most successful Workflow Management System (WfMS) supporting instance adaptation and migration. Regarding version of BPs, we distinguish two main contributions. [17] has proposed to deal with dynamic business process evolution, i.e. modification of business process schemas in the presence of active business process instances, introducing versions of BP schemas. This work has defined a set of operations for BP schema modification and, if possible, a strategy for migration of BP process instances. Recently, [18] has also defended the advantages of a version-based approach to face business process evolution. More precisely, this work proposes to model versions of BP process schemas using graphs. It also presents a set of operations enabling updates of graphs and defines two strategies to extract versions of BP process schemas from these graphs. We believe that these two propositions need to be revisited. Indeed, both [17] and [18] addressed the issue of BP versioning only considering the functional and process perspectives of business processes. These two perspectives describe activities involved in the process and their coordination. But, using only theses perspectives is not enough to obtain a comprehensive description of BPs [19]. At least three other perspectives have to be considered: the organizational, the informational and the application perspectives [20]. The organizational perspective structures the business process

4

M.A. Chaâbane et al.

actors and authorizes them, through the notion of role, to perform tasks making up the process. The informational perspective defines the structure of the documents and data required and produced by the process. The application perspective describes elementary operations performed by actors involved the process. The contribution of this paper is twofold. First, it discusses the relevance of versioning to deal with BP flexibility. Second, it introduces an activity-oriented metamodel to describe versions of BPs. This meta-model uses five perspectives to model business processes (functional, process, informational, organizational and operation perspectives) and provides a versioning kit in order to handle versions of elements belonging to these five perspectives. The remainder of this paper is organized as follows. Section 2 discusses the relevance of versions to deal with BP flexibility. Section 3 introduces the Business Process (BP) meta-model we use for designing BP, while section 4 presents the Versioned Business Process (VBP) meta-model we propose for business process versioning. More precisely, this section presents the versioning kit we provide for handling versions of business processes, and explains how the kit is merged with the BP metamodel to define the VBP meta-model. Finally, section 5 concludes the paper.

2 Are Versions an Help to Flexibility? This question deserves to be discussed. Consequently, this section introduces the notion of version of business process, and also indicates in which cases of business process flexibility versions are useful. 2.1 Version of Business Processes A real world entity has characteristics that may evolve during its lifecycle: it has different successive states. A version corresponds to one of the significant entity states. So, it is possible to manage several entity states (neither only the last one nor all the states). The entity versions are linked by a derivation link; they form a version derivation hierarchy. When created, an entity is described by only one version. The definition of every new entity version is done by derivation from a previous one. Such versions are called derived versions. Several versions may be derived from the same previous one. They are called alternative versions. A version is either frozen or working. A frozen version describes a significant and final state of an entity. A frozen version may be deleted but not updated. To describe a new state of this entity, we have to derive a new version (from the frozen one). A working version is a version that temporarily describes one of the entity states. It may be deleted and updated to describe a next entity state. The previous state is lost to the benefit of the next one. As illustrated in figure 1, it is possible to manage versions both at the schema and the instance levels. However, in the BP context, it is only interesting to consider versions at the schema level. Moreover, the notion of version must be applied to all the concepts defined at the schema level. In this paper, we consider the five perspectives of BPs. In propositions of the state of the art [16,17,18], only two perspectives are addressed.

Versions to Address Business Process Flexibility Issue

E1.v0

Entities

E1.v1

E1.v2 Versions

E1 En.v1 En

En.v2 En.v0

5

E1.v3

En.v3

Fig. 1. Versions to describe Entity Evolution

Finally, it is useless to handle versions of BP instances (cases). However, instance adaptation and migration have to be considered since, as discussed in [14,15], it is important to have instances of BPs consistent with their last schema. This issue is not addressed in this paper and will be approached latter. However, we can note that managing versions permits to get it round. Indeed, versions permit to different instances of a same BP to own different schemas. Consequently, instances adaptation and migration is not required. Moreover, as indicated in [14], this adaptation and migration is not always easy and is sometimes impossible. It means that versions are necessary to face instance adaptation and migration. 2.2 Versions and Business Process Flexibility This section gives an answer to the following question: in which cases of business process flexibility versions are useful? The classifications provided in [6] and [7], which seem to be the main classifications of literature, are used to answer this question. In [6], a simple taxonomy of business process flexibility is introduced and used to analyze the flexibility degree of BPs in some implemented solutions. This taxonomy identifies four types of flexibility for business processes: • • • •

Flexibility by design for handling foreseen changes in BP where strategies can be defined at design-time to face these changes. Flexibility by deviation, for handling occasional unforeseen changes and where the differences with initial business process are minimal. Flexibility by under-specification, for handling foreseen changes in BP where strategies cannot be defined at design-time but rather defined at run-time. Flexibility by change, for handling unforeseen changes in BP, which require occasional or permanent modifications in BP schemas.

To our opinion, versions are a help to both flexibility by design, under-specification and change. Regarding flexibility by design, it is quite possible to define, using alternative versions, each possible execution path of the considered business process. Regarding flexibility by under-specification, [6] identifies two ways of realizing it: late binding and late modelling. Versions are a possible solution for implementing late binding: the different possible realization of a business process can be modelled as alternative versions of this business process and, as suggested in [21], a rule-based system could be used to select one of the modelled alternative versions at run-time. However, a rule-based system is a technical solution for the dynamic selection of an

6

M.A. Chaâbane et al.

alternative version, and we believe that a conceptual solution, introducing an intentional perspective in modelled BP (as in [10]), could be richer to deal with this problem. We have planned to investigate this soon. Finally, regarding flexibility by change, it is obvious that versions are a possible solution to realize evolutionary change both at the instance or the schema level. [7] provides a more complex taxonomy for BP flexibility than [6]. Figure 2 below recaps its main properties and techniques.

Nature of the Flexibility Selection

Flexibility Techniques

Adaptation Nature of the Change Nature of the Impact Local

Global

Evolution Techniques

Evolution Techniques

Migration Techniques

Late binding Late modeling Case handling Ad hoc Corrective Evolutionary Ad hoc Rule-based Cancellation Without propagation With propagation

Ad hoc Derivation Inheritance ...

Fig. 2. Taxonomy of Business Process Flexibility in [7]

First this taxonomy is discussed according to: (i) the kind of models used to design business processes (activity-oriented, product-oriented, decision-oriented and conversation models) and the perspectives that these models consider (functional, process, organizational, informational, operational and intentional perspectives), (ii) and the kind of business processes which can have a more or less well-defined structure (production, administrative, collaborative or ad-hoc workflows/business processes). The provided taxonomy puts forward several properties: nature of the flexibility (a priori by selection- or a posteriori -by adaptation-), nature of impact (local or global), and nature of change (ad hoc, corrective, evolutionary). The taxonomy also puts forward some techniques to handle flexibility: evolution (ad hoc, derivation, inheritance, induction, reflexion, rule-based), migration (cancellation, with propagation, without propagation) and flexibility (late binding, late modelling, case handling) techniques. According to this taxonomy, we defend the idea that versions are useful to deal with both cases of a priori and a posteriori flexibility. Regarding a priori flexibility, versions are a way to define and implement late binding flexibility technique, using alternative versions. Regarding a posteriori flexibility, it is also possible to model using a set of alternative versions, i.e. a set of possible executions that could be modelled using

Versions to Address Business Process Flexibility Issue

7

generic models (as illustrated in [22] where genericity is implemented using inheritance relationship). Finally, versions are obviously useful to support evolutionary changes, and of course, permit to handle instances easily since migration with propagation is not mandatory.

3 Modelling Business Processes After highlighting the relevance of versions for business process flexibility, we introduce the BP meta-model we propose for business process modelling. This metamodel supports the design of BPs combining the five perspectives listed before: the functional, process, informational, organizational and operational perspectives. As defended in [25,26], these perspectives are relevant for BP modelling and execution. Another important requirement for such a meta-model is its simplicity and efficiency: it must be comprehensive and must define the core (basic) concepts of the five complementary perspectives of BPs. But does such a meta-model for business process modelling (i.e. meeting the previous requirements) already exists, or do we have to define a new one by ourselves? Despite the standardization efforts of the Workflow Management Coalition (WfMC), different workflow or business meta-models exist in literature. The used vocabulary differs from one model to another, and yet, so far, the workflow and business process community seems to not have reached an agreement on which model to adopt, even if XPDL, BPMN and BPEL are standards recommended by the WfMC. Consequently, we have defined our own meta-model, which fulfils the previous requirements: (i) a comprehensive meta-model considering five complementary aspects of business processes and (ii) a BP meta-model defining the core concepts of these complementary BP perspectives. This meta-model is shown in the UML diagram of figure 3. A Process performs activities, which can be atomic or composite. Only the first of these activities is explicitly indicated in the meta-model. If an activity is composite, the Composed_of relationship gives its component activities, which are coordinated by a control pattern. In our meta-model, and as for instance in [23], the main control patterns described in the literature are provided. Some of them are conditional (e.g. if, while…), while others are not (e.g. sequence, fork…). Their semantics are the following: • • • • •

Sequence pattern: it allows the execution of processes in a sequential order. If pattern: it allows processes execution according to a condition. Fork pattern: it spawns the parallel execution of processes and waits for the first to finish. Join pattern: it spawns the parallel execution of processes but waits for all of them before completing. While and Repeat patterns: they cyclically execute a process while or until a condition is achieved.

Our meta-model only includes low-level (basic) control patterns; all the high-level workflow patterns of [24] are not considered here (they are much more complex than what we need). In this way, the meta-model we propose could be seen as a minimal BP meta-model gathering the core concepts of BPs.

8

M.A. Chaâbane et al.

Composed_of

Non Conditional Control pattern *

Composite activity

Control pattern

*

Uses

1

Conditional Control pattern

2..* Start_with

Has_pre-conditions

Activity

1

* *

Process

*

Process perspective

*

Atomic activity

Functional perspective

*

Produces

1..*

1..* Performed_by

1..*

0..1

0..1

Condition 1

Has_post-conditions

*

Operational perspective

Operation

Executes

has

Consumes 1..* *

*

1..*

1..*

1..*

Played_by

Data

Document

System Data

Application Data

Database

Informational perspective

1..*

Actor

1..* Requires

Not Human

Form

Data Repository

* Is_member_of

*

Process Data

Organizational Unit

Belongs_to

Role

Informational resource

1..*

Software

Machine

Human

Internal

External

Organizational perspective

Fig. 3. The Business Process Meta-model

An Atomic activity can have pre-condition (or start condition), post-conditions. It executes one or several Operations, and is performed by a Role, which is played by several Actors in some Organizational units (of organizational perspective). An actor can be (i) human or not human (i.e. software or machine) and (ii) internal or external. Moreover, an atomic activity consumes and/or produces Informational resources (of informational perspective). An informational resource is a system data, an application data (i.e. data repository or database), or a process data (i.e. form, document, data). The different perspectives of BPs are visualized in figure 3. The functional perspective describes activities to perform during process execution. Besides, it specifies how a composite activity is decomposed by atomic or composite activities. In the process (or control flow) perspective, execution conditions (pre-conditions and postconditions) and the coordination between activities (control pattern) are specified. Generally, the functional perspective and the process perspective are given by the process definition. The operational (or application) perspective defines elementary operations performed into atomic activities. Typically, these operations are used to create, read or modify control and production data which are often executed using external applications. The organizational (or resource) perspective describes relationships between roles, groups and actors giving these latter authorizations to perform atomic activities. Finally, the informational (or data) perspective deals with production and use of information. We can note that these perspectives have in common classes; for instance the Atomic activity class both belongs to the process and the functional perspectives.

4 Modeling Versions of Business Processes This section presents the versioning kit we use to handle BP versions. It explains how the BP meta-model is merged with the versioning kit in order to obtain the Versioned Business Process (VBP) meta-model.

Versions to Address Business Process Flexibility Issue

9

4.1 Versioning Kit The underlying idea of our proposition is to model, for each versionable class of the BP meta-model, both entities and their corresponding versions. According to [11], a versionable class is a class for which we would like to handle versions. Thus, we have defined a versioning kit to make classes versionable. This kit, visualized in figure 4, is composed of two classes, five properties and two relationships. Each versionable class is described as a class, called Versionable. Moreover, we associate to each versionable class, a new class, called Version_of Versionable, whose instances are versions of Versionable, and two new relationships: (i) the Is_version_of relationship which links a versionable class with its corresponding version of… class; and (ii) the Derived_from relationship which describes version derivation hierarchies. This latter relationship is reflexive and the semantics of both relationship sides is: (i) a version (DV) succeeds another one in the derivation hierarchy and, (ii) a version (SV) precedes another one in the derivation hierarchy. Regarding properties, we introduce classical properties for versions [11] such as version number, creator name, creation date and status in the Version_of class. Derived_from SV *

0..1

BV

Version_of Versionable 1..* Is_version_of 1

Versionable SV: Source Version DV: Derived Version

Fig. 4. The Versioning Kit

Thus, using this kit, it is both possible to describe entities and their corresponding versions. The creation of versions is managed as follows: (i) a couple (version, entity) is obviously created when the first version of an entity is created; and, (ii) new versions can be created by derivation of an existing version, giving rise to derived or alternative versions. 4.2 Merging the Versioning Kit with the Business Process Meta-model We use this versioning kit to make some classes of the BP meta-model versionable. Figure 5 below presents the new obtained meta-model in terms of classes and relationships. Regarding the process and functional perspectives, we think that it is necessary to keep versions for only two classes: the Process and the Atomic activity classes. It is indeed interesting to keep changes history for both processes and atomic activities since these changes correspond to changes in the way that business is carried out. More precisely, at the process level, versions are useful to describe the possible strategies for organizing activities while, at the activity level, versions of atomic

10

M.A. Chaâbane et al.

Start_with_CA 1

1

2..*

*

Control pattern

DV

SV

*

Version of Process

0..1 2..*

Atomic activity

Composed_of_VVA

1..*

Process

Start_with_VAA

Derived_from DV SV 1 *

*

1

Process Data

*

1..*

Is_version_of

Version of Informational resource

System Data

*

1..*

1..*

Operational perspective

*

Version of Operation *

Executes

1..*

Derived_from

Performed_by

1..*

1..*

* SV

SV *

Belongs_to

Version of Role *

1..* 0..1 * * DV SV Is_version_of Derived_from Played_by 1..*

1..*

1

1..*

0..1 DV

Role

1 1..*

has 1

Is_version_of

Is_version_of

Application Data

Condition

Has_post-conditions

*

* *

BV Derived_from DV 0..1

0..1

*

1..* * Produces

Consumes

Has_pre-conditions

Is_version_of

1

Version of Atomic * activity

1

Functional perspective

Process perspective Informational resource

Conditional Control pattern

Activity

*

1..* Is_version_of 1

Uses

*

Composite activity

0..1

*

Non Conditional Control pattern

Composed_of_CA

Derived_from

Operation DV Derived_from 0..1

Version of Organizational_Unit

1

Organizational_unit

* Is_member_of 1..*

Actor

Requires

Data

Document

Form

Database

Data Repository

Not Human Software

Informational perspective

Human

Internal

External

1..*

Machine

Organizational perspective

Fig. 5. The Versioned BP Meta-model

activities describe evolution in activity execution. We defend the idea that versioning of processes and atomic activities is enough to help organizations to face the fast changing environment in which they are involved nowadays. Regarding the others perspectives, it is necessary to handle versions for the Operation class of the operational perspective, for the Informational resource class of the informational perspective, and for the Role and Organizational Unit classes of the organizational perspective. When merging the versioning kit with the BP meta-model, we need to decompose the Start_with relationship into two relationships: Start_with_CA and Start_with_VAA. We distinguish these two relationships because it is impossible, with only one, to describe both versions of BPs starting with either a composite activity, or a version of an atomic activity. In the same way, the Composed_of relationship is decomposed into two new relationships: Composed_of_CA to model composite activities composed of composed activities, and Composed_of_VAA to model composite activities composed of versions of atomic activities. 4.3 Illustrative Example In order to illustrate the VBP meta-model instantiation, we propose to use the example introduced by [18]. This example describes a Production BP and involves a factory, which owns one production pipeline following the BP shown in figure 6(a). It includes several activities: production scheduling, production using a work centre, quality checking and packaging. In order to increase its productivity, the factory decides to add a new work centre. The business process is then updated as shown in figure 6(b). If one of the two work centres, for instance work centre#1 (Pc#1), has a technical problem and consequently is removed from the process, two solutions are proposed to attempt keeping the production output: fixing unqualified products or using employees for manual production. The BP is then updated as shown in figure 6(c) and 6(d).

Versions to Address Business Process Flexibility Issue

Start

Start 6(a) Cof

Start

Pc#1, Po Ma Produce

Pc#1, Ma Po Produce

Pc#2, Po Ma Produce

End

Pc#1 Po Produce (Manual)

Pc#2, Po Ma Produce

Qs Quality Checking

Pac Packaging

End

Ss

Schedule Production

Em Quality Checking

Pac Packaging

Roles Em: Enterprise manager Pc: Production work center Pac: Packaging work center Ma: Machine

Cof, E-Co

Pc#2, Po Ma Produce

Pc#1, Ma, Ms Fix Unqualified Products

Em Quality Checking

Pac Packaging

6(d)

Cof, E-Co Ss Schedule Production

Cof, E-Co Ss Schedule Production

Em Quality Checking

Start

6(c)

6(b)

Em Schedule Production

11

Pac Packaging

End

End

Informational Resource Cof: Customer order form Po: Production order form E-Co: Electronic customer order form

Ss: Scheduling service Ms: Maintenance service Qs: Quality service

Fig. 6. Change in the Production BP

This example, illustrated in figure 6, shows four versions of the same Production BP. These four versions correspond to the VP1, VP2, VP3 and VP4 versions of figure 7. These four versions are modelled as instances of the VBP meta-model. They differ from one another in their component activities and the way these activities are coordinated to. In this way, we have defined two versions of the atomic activity Schedule production. The first one (VAA11) only participates in Vp1; it is performed by the role Enterprise manager (Em) and consumes a Customer order form (Cof). The second one (VAA12) is referenced by the other versions of the BP; it is performed by a new role, Scheduling service (Ss), only when either the Customer order form (Cof) or the Electronic Customer order form (E-CO) are consumed. These two versions

CP7, Join(VAA21, VAA22) Join

CP6, Join(VAA21, VAA5) Uses

CP5, Join(VAA21, VAA22)

Sequence

CP4, Sequence(VAA12, CP7, VAA32, VAA4) CP3, Sequence(VAA12, CP6, VAA31, VAA4) Uses

CP2, Sequence(VAA12, CP5, VAA31, VAA4) Derived_from

CP1, Sequence(VAA11, VAA21, VAA31, VAA4)

VP14, P, VP13 VP13, P, VP12 P: Production

VP12, P, VP11 VP11, P

Is_version_of

Start_with_CA

Composed_of

AA5: Fix unqualified

VAA5, AA5 VAA4, AA4

AA4: Packaging

VAA32, AA3, VAA31 VAA31, AA3

AA3: Quality checking

VAA22, AA2, VAA21 VAA21, AA2

AA2: Produce

VAA12, AA1, VAA11 AA1: Schedule production

VAA11, AA1

Is_version_of Derived_from

Fig. 7. Instantiation of the VBP Meta-Model

12

M.A. Chaâbane et al.

produce the same document, Production order (Po). Furthermore, this example shows two versions of the atomic activity Produce. These versions consume the same document, Production order Po. The first one (VAA21) is performed by the roles Machine and Production work centre, while the second one (VAA22), which corresponds to a manual production, is only performed by the role Production work centre (Pc). Besides, figure 7 includes two versions of the atomic activity Quality checking. The first one (VAA31) is performed by the role Enterprise manager, while the second one (VAA32) is executed by the role Quality service. Finally, there are only one version for the atomic activities Packaging (VAA4) and Fix unqualified products (VAA5). Because of space limitation and for clarity reasons, we only visualize in figure 7 the instantiation of classes belonging to the process and functional perspectives (i.e. Process, Version of Process, Atomic activity, Version of Atomic activity, Composite activity and Non conditional control pattern). Finally, this example illustrates how versions permit to deal with flexibility by change.

5 Conclusion This paper has defended that versioning is an interesting solution to deal with (activity-oriented) business process flexibility. More precisely, it first has identified in which cases of both a priori and a posteriori flexibility versions are useful according to the two main typologies provided by literature. For instance, according to the classification of [6], versions are a mean to deal with flexibility by design, flexibility by under-specification and flexibility by change. The paper has then explained why proposed solutions of the literature need to be revisited, and, according to the specified requirements (i.e. considering more than the process and functional perspectives for versioning business processes), it has introduced the VBP meta-model. The advantages of our proposition are the following: • •

It provides a comprehensive modelling of business processes considering five perspectives of business processes: the functional, process, informational, organizational and operation perspectives. The VBP meta-model is simple: it only integrates core concepts for both business process modelling and business process versioning (our versioning kit is very simple).

Because of space limitation, we have not reported in this paper several contributions related to the handling of flexible business processes using versions. More precisely, we have defined a taxonomy of operations for business process versions [25], along with a language implementing these operations. We also have given rules and algorithms to visualize and formalize instances of the VBP meta-model using a Petri net-based formalism, namely Petri Net with Objects (PNO). We are currently implementing the VBP meta-model, its related language and a PNO representation of its instances. Finally, to achieve this work, we have planned to investigate another perspective of business process modelling: the intentional perspective. Our objective is to give information about why defining a BP version in order to use it appropriately. This objective is somewhat related to the notion of context introduced in [10] and [21].

Versions to Address Business Process Flexibility Issue

13

Introducing the intentional dimension of business processes, we believe that we will fully have dealt with business process versioning.

References 1. Smith, H., Fingar, P.: Business Process Management: the Third Wave. Megan-Kiffer Press (2003) 2. van der Aalst, W.M.P., ter Hofstede, A.H.M., Weske, M.: Business Process Management: A Survey. In: van der Aalst, W.M.P., ter Hofstede, A.H.M., Weske, M. (eds.) BPM 2003. LNCS, vol. 2678, pp. 1–12. Springer, Heidelberg (2003) 3. Reijers, H.: Workflow Flexibility: the Forlon Promise. In: Int. Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, Manchester, United Kingdom, June 2006, pp. 271–272 (2006) 4. Sadiq, S., Weber, B., Reichert, M.: Beyond Rigidity: Lifecycle Management for Dynamic Processes. Tutorial at Int. Conference on Business Process Management, Brisbane, Australia (September 2007) 5. Nurcan, S., Schmidt, R., Soffer, P.: Int. Workshop on Business Process Management, Design and Support, at Int. Conference on Advanced Information Systems, Montpellier, France (June 2008) 6. Schoneneberg, H., Mans, R., Russell, N., Mulyar, N., van der Aalst, W.: Process Flexibility: A Survey of Contemporary Approaches. In: Int. Workshop on CIAO/EOMAS, at Int. Conference on Advanced Information Systems, Montpellier, France, June 2008, pp. 16–30 (2008) 7. Nurcan, S.: A Survey on the Flexibility Requirements related to Business Process and Modeling Artifacts. In: Hawaii International Conference on System Sciences, Waikoloa, Big Island, Hawaii, USA, January 2008, p. 378 (2008) 8. Lezoche, M., Missikof, M., Tininii, L.: Business Process Evolution: a Rule-based Approach. In: Int. Workshop on Business Process Management, Design and Support, at Int. Conference on Advanced Information Systems, Montpellier, France (June 2008) 9. Pesic, M., van der Aalst, W.: A Declarative Approach for Flexible Business Processes. In: Int. Workshop on Dynamic Process Management, at Int Conference on Business Process Management, Vienna, Austria, September 2006, pp. 169–180 (2006) 10. Bessai, K., Claudepierre, B., Saidani, O., Nurcan, S.: Context Aware Business Process Evaluation and Redesign. In: Int. Workshop on Business Process Management, Design and Support, at Int. Conference on Advanced Information Systems, Montpellier, France (June 2008) 11. Sciore, E.: Versioning and Configuration Management in Object-Oriented Databases. Int. Journal on Very Large Databases 3(1), 77–106 (1994) 12. Kimball, J., Larson, A.: Epochs: Configuration Schema, and Version Cursors in the KBSA Framework CCM Model. In: Int. Workshop on Software Configuration Management, Trondheim, Norway, June 1991, pp. 33–42 (1991) 13. Roddick, J., Craske, N., Richards, T.: A Taxonomy for Schema Versioning based on the Relational and Entity Relationship Models. In: Int. Conference on the Entity Relationship Approach, Arlington, Texas, USA, December 1993, pp. 137–148 (1993) 14. Casati, F., Ceri, S., Pernici, B., Pozzi, G.: Workflow Evolution. In: Int. Conference on the Entity Relationship Approach, Cottbus, Germany, October 1996, pp. 438–455 (1996)

14

M.A. Chaâbane et al.

15. Kammer, P., Bolcer, G., Taylor, R., Bergman, M.: Techniques for supporting Dynamic and Adaptive Workflow. Int. Journal on Computer Supported Cooperative Work 9(3-4), 269–292 (1999) 16. Reichert, M., Rinderle, S., Kreher, U., Dadam, P.: Adaptive Process Management with ADEPT2. In: Int. Conference on Data Engineering, Tokyo, Japan, April 2005, pp. 1113–1114 (2005) 17. Kradofler, M., Geppert, A.: Dynamic Workflow Schema Evolution based on Workflow Type Versioning and Workflow Migration. In: Int. Conference on Cooperative Information Systems, Edinburgh, Scotland, September 1999, pp. 104–114 (1999) 18. Zhao, X., Liu, C.: Version Management in the Business Change Context. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 198–213. Springer, Heidelberg (2007) 19. Jablonsky, S., Bussler, C.: Workflow management. Modeling Concepts, Architecture and Implementation. Thomson Computer Press (1996) 20. van der Aalst, W.M.P.: Business Process Management Demystified: A Tutorial on Models, Systems and Standards for Workflow Management. In: Desel, J., Reisig, W., Rozenberg, G. (eds.) Lectures on Concurrency and Petri Nets. LNCS, vol. 3098, pp. 1–65. Springer, Heidelberg (2004) 21. Adams, M., ter Hofstede, A., Edmond, D., van der Aalst, W.: Worklets: A ServiceOriented Implementation of Dynamic Flexibility in Workflows. In: Int. Conference on Cooperative Information Systems, Montpellier, France, November 2006, pp. 291–306 (2006) 22. van der Aalst, W.: How to handle Dynamic Change and Capture Management Information: an Approach based on Generic Workflow Models. Int. Journal on Computer Science, Science and Engineering 16(5), 295–318 (2001) 23. Manolescu, D.A.: Micro-Workflow: A Workflow Architecture Supporting Compositional Object-Oriented Development. PhD Thesis, University of Illinois (2001) 24. van der Aalst, W., ter Hofstede, A., Kiepuszewski, B., Barros, A.: Workflow Patterns. Int. Journal on Distributed and Parallel Databases 14(1), 5–51 (2003) 25. Chaâbane, M.A., Bouzguenda, L., Bouaziz, R., Andonoff, E.: Dealing with Business Process Evolution using Versions. In: Int. Conference on E-Business, Porto, Portugal, July 2008, pp. 267–278 (2008)

A Rule-Based Modeling for the Description of Flexible and Self-healing Business Processes Mohamed Boukhebouze1, Youssef Amghar1, Aïcha-Nabila Benharkat1, and Zakaria Maamar2 1

Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621, France {mohamed.boukhebouze,youssef.amghar, nabila.benharkat}@insa-lyon.fr 2 CIT, Zayed University, Dubai, UAE [email protected]

Abstract. In this paper we discuss the importance of ensuring that business processes are label robust and agile at the same time robust and agile. To this end, we consider reviewing the way business processes are managed. For instance we consider offering a flexible way to model processes so that changes in regulations are handled through some self-healing mechanisms. These changes may raise exceptions at run-time if not properly reflected on these processes. To this end we propose a new rule based model that adopts the ECA rules and is built upon formal tools. The business logic of a process can be summarized with a set of rules that implement an organization’s policies. Each business rule is formalized using our ECAPE formalism (Event-ConditionAction-Post condition- post Event). This formalism allows translating a process into a graph of rules that is analyzed in terms of reliably and flexibility. Keywords: Business processes modeling, business rules, flexible modeling, Change impact and self-healing of business process.

1 Introduction The dynamic environment of organizations makes the process elements subject of frequent change. The origin of change comes mainly from frequent changes in first, regulations that organizations have to comply with and second, internal policies that organizations themselves develop [1]. These regulations and policies are often expressed in terms of business rules that are sometimes defined as high-level structured statements that constrain, control, and influence the business logic [2]. Business rules should be formalized to facilitate their use. Unfortunately, using the imperative languages such as BPEL [3], designers implement business rules based on decisions (what process branch must be chosen) that are defined using connectors (e.g., sequence, parallel split, exclusive choice). In this way, designers use the results of the decisions to determine the process behavior rather than to model these decisions. This makes business processes rigid. To formalize the business rules in rigorous, concise and precise way, a rule-based approach proposes to model the logic of the process with a set of business rules using J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 15–27, 2009. © Springer-Verlag Berlin Heidelberg 2009

16

M. Boukhebouze et al.

declarative languages. This will allow deploying partially-specified process definitions (using rules) [4]. In addition, the changes (in process logic, in business regulations or in business policies) are realized by changing subset of rules (e.g., modify, insert and delete existing rules) which express the changed process logic, the changed business regulations or the changed business policies. As a result, the modification of a rule impacts only a subset of rules that are related to the changed rule, which would lead to a reduction of the efforts to put into this change management. However, in a complex processes, it is important to manage the impact of a rule change on the rest of the process by determining which rules are impacted by this change. In addition, these changes may raise exceptions at run-time if not properly reflected on these processes. For this reason, we present, in this paper, a new rule based model that aims at improving the management of business processes in terms of flexibility and verification. By flexibility we mean how to implement changes in some parts of a business process without affecting the rest of parts neither the continuity and stability of these parts [5]. And by self-healing we mean the ability to detect and isolate the failed component, fix or replace the component, and finally, reintroduce the repaired or replaced component without any apparent application disruption [6]. The new proposed model extends the ECA rules and is built upon formal tools. Each business rule is formalized using our ECAPE formalism (Event-Condition-Action-Post condition- Event triggered). The great advantage of this formalism is that processes can be easily translated into a graph of rules. Analyzing this graph guarantees the modeling flexibility by studying the relationships between the rules and the self-healing execution of a business process by identifying, in the modeling phase, any risk of exceptions (verification step) and managing these exceptions, in the execution phase, in order to ensure a proper functioning of a process (exceptions handling step). The rest of this paper is organized as follows. We introduce in section 2 the new model. In section 3, we outline the flexibility process modeling management. In section 4, we explain how process can be self-healed. We wrap up the paper with a related work, conclusion and some directions for future works.

2 Rule Based Modeling of Business Process 2.1 Definition The objective of a rule based model is to describe business processes as a set of rules. Consequently, the sequences of these rules define the behavior of a process. According to Giurca et al. in [7], it is advantageous to use reaction rules (ECA formalism) for specifying business processes. Giurca et al. justify this by the fact that this kind of rules gives a flexible way to specify process control flow using events. ECA rules cover several kinds of business rules particularly integrity rules and derivation rules. However, we need to a new type of ECA formalism that help the management of the change impact of rules and the automatic building of an execution scenario of a process to ensure the proper functioning. For this reason, we propose the formalism ECAPE as follows:

Description of Flexible and Self-healing Business Processes ON IF DO Check Trigger

17

<Event> <post Event>

The semantics attached to an ECAPE rule is: The event determines when a rule must be evaluated (or activated) the condition is a predicate on which depends the execution of action (it can be seen as a refinement of the event); the action specifies the code to execute if condition is true; the post condition is a predicate on which depends the validation of the rule (the rule is validated only if the post condition is true) and the events triggered (post events) design the set of events raised after the execution of the action. Note that, if the post condition does not hold, a compensation mechanism is launched in order to try, if it is possible, to compensate the executed action part effects. But the compensation mechanism is not scope of this paper. The sequence of the ECAPE rules defines the behavior of a process. Indeed, each rule may activate one or more rules. The originality of this formalism is the fact that the set of events triggered after the execution of the rule’s action, is explicitly described. As a result, a rule sequence can be automatically deducted. 2.2 Illustrative Example In this section we introduce the example of purchase order process to illustrate the RbBPDL language. Upon receipt of customer order, the calculation of the initial price of the order and shipper selection are done simultaneously. When both tasks are complete, a purchase order is sent to the costumer. In case of acceptance, a bill is sent back to the customer. Finally, the bill is registered. Two constraints exist in this scenario: customer record must exist in the company database, and bill payment must be done 15 days before delivery date.

Fig. 1. ECAPE rules set of the purchase order process

18

M. Boukhebouze et al.

Figure 1 represents the ECAPE rules set of the purchase order process. Indeed, in our new business process model, a process is seen as a set of decisions and policies. These decisions and policies are defined by a set of business rules. For example, rule R1 expresses the policy of requesting an order. This rule is activated by “begin process” event that represents customer order (it may be, for example, clicking on the button "Place an order"). The execution of the activity “RequestOrder” triggers the “Send message” event. This latter will activate rule R2. In turn, rule R2 expresses the policy of receiving an order. Indeed, during “Receive Order” event occurrence the rule is triggered and the action’s instruction <Execute> is executed. This instruction specifies that a given business activity must be performed (“CostumerCheck” in our example). The execution of this instruction triggers the event “CostumerCheck Executed”. This latter activates three rules R3 (policy of initial price calculation), R4 (policy of shipper selection) and R5 (policy of reject order when costumer is not registered). In turn, the execution of these rules actions actives another rules. And so on, until the end of process rules set.

3 Flexibility Management The first aim of our work is to automate the management of the flexibility of business process rule based modeling by estimating the impact of business process changes. This should help in planning, organizing, and managing the necessary resources that would ensure change achievement. To achieve this objective, we need to study the relationship between the rules. We identify three relationships between business rules: 1. Inclusion relationship: Shows the case of a rule (base rule) that includes the functionality of another rule (inclusion rule). Two rules have an included relationship between them if the completion of the base rule’s action requires the completion of the inclusion rule’s action. In the previous example, to calculate the final price, the shipping price must be calculated before. R6

« Inclusion »

R7 Inclusion rule

Based rule

2. Extension relationship: Shows the case of a rule (extension rule) that extends the functionality of another rule (base rule). Two rules have an extension relationship between them if the completion of the extend rule’s action achieves the completion of the base rule’s action. In the previous example, if we suppose that a loyal customer receives a discount and a new discount rule R12 is added. As a result, there is an extension relationship between R2 (rule to identify a costumer) and R12 (rule to calculate discount) because the functioning of R2’s action will complete the functioning of R12’s action. R2

« Extension »

Extension rule

R1 Base rule

Description of Flexible and Self-healing Business Processes

19

3. Cause/Effect relationship: Shows the case of a rule (cause rule) that activates another rule (effect rule). Two rules have a cause and effect relationship between them if the execution of a rule will activate the effect rule. As a result, the execution of a cause rule’s action triggers a post event, which necessary activates the effect rule. Thanks to this relationship, the order of process activities can be defined by describing the post events based on ECAPE. In our previous example, the performance of R2’s action (costumer verification) will trigger end-customer–verification post-event. This latter is the event activator of rule R3. There is a cause and effect relationship between R2 and R3. « Cause/Effect »

R2

R3

Cause rule

Effect rule

Note that, the included and extension relationships are manually defined by a designer, while cause/effect relationship can be detected automatically by analyzing the events and post events rules parts. The fact of defining relationships between business rules allows determining which rules must be revised in case of change. Firstly, all base rules which have an inclusion relationship with a changed inclusion rule must be revised by a business process designer. In the previous example, if the enterprise decides not to deliver its products rule R4 will be deleted from the process model. The suppression of an inclusion rule (R4) will affect a base rule, which requires the completion of the inclusion rule’s action. Due to this, human intervention is required to decide how we can change a base rule in order to keep the process coherence. Secondly, all base rules which have an extend relationship must be revised when an extend rule is changed. In the previous example, if we change rule R2 (rule responsible for costumer identification), which represents an extension rule, then base rule R12 (rule responsible for discount calculation) must be revised. Finally, all effect rules which have a cause/effect relationship must be revised if the cause rule is changed in order to ensure the activation of these rules. For example the consequence of removing rule R2 in our previous process is the inactivation of R3, because R2 is the cause of activating R3. For this purpose, a designer must revise the effect rules if the cause rule is changed. To formalize the flexibility management of a process model, we propose to translate the business process into a graph of rules. Indeed, vertices of this graph represent the business rules, which constitute the business process, and arcs represent the relationships between the various rules. A graph of rules is formally defined as follows: Definition1. A graph of rules is a directed graph Gr (R, Y) with - R is a set of vertices that represent business rules. - Y is a set of arcs that represent three kinds of relationships. (1) Yi is a sub set of Y such that if yi (ri, rj) then ri is included in rj. (2) Ye is a sub set of Y such that if ye (ri, rj) then ri extend rj. (3) Yc is a sub set of Y such that if yc (ri, rj) then ri cause the activation of rj. The rule graph of our previous example is illustrated by figure 2. The graph of rules helps determine which rules are impacted by the change of a rule. Indeed, if any vertex changes, all successor vertices must be revised. Formally this will be defined as follows:

20

M. Boukhebouze et al.

Fig. 2. Rules graph of the purchase order process

Definition 2. let Gr (R, Y) be a rule graph and ri a vertex rule such that ri The set of ri successor neighbors is noted as N+(ri) such that ∀ rj include, extend or, cause rule for the base or effect rule rj.

∈ R.

∈ N+(ri), ri is either

- We note Ni+(ri) the set of ri successors such that ∀ rj for the base rule rj.

∈ N+(ri), ri is an include,

- We note Ne+(ri) the set of ri successors such that ∀ rj rule for the base rule rj.

∈ N+(ri), ri is an extend

- We note Nc+(ri) the set of ri successors such that ∀ rj for the effect rule rj.

∈ N+(ri), ri is a cause rule

- We note Nc-(ri) the set of ri predecessors such that ∀ rj ∈ N-(ri), rj is a cause rule for the effect rule ri. - We note N*(ri) the set of ri neighbors such that N*(ri) = Ni+(ri) ∪ Ne+(ri) ∪ Nc+(ri) ∪ Nc-(ri). If ri ∈ R change, then the designer will have to revise all rules N*(ri). Indeed, to keep the process coherence, the flexibility management of the process modeling will request from a designer to revise the N*(ri) set when a rule ri is changed. In the example of figure 3, rule R6 must be revised if rule R4 is deleted because N*(R4) = {R2, R6}. The flexibility management notifies a business process designer to revise rule R2 and R6 in order to decide how this rule can be changed. Note that we must check out the predecessor neighbors Nc-(ri) for the cause/effect relationship since it is not acceptable that a rule activates a non-existing rule. For instance, if we delete R4 we will also have to revise R2 to ensure that this letter does not activate a deleted rule.

Description of Flexible and Self-healing Business Processes

21

However, when changing the set of successor neighbor’s include and extend rules (Ne+(ri) ∪ Nc+(ri)) the designer should revise entirely the concerned rules. This revision may generate a cascade of rule change. Indeed, if one rule changes, the set of include and extend rules will be revised and properly changed. This will raise the need to revise another set of successor neighbor’s rules of the rule that was revised. In the process example, if we change R4, then rule R6 (extend rule) will be revised. This revision consists of analyzing the entire code of rule R6 to decide how we can change this latter in order to keep the coherence of the process. If we change rule R6 after its revision, this results in revising R7. In turn, R7 can be changed after revision, this results into revising R8 and R12. And so on, until we don’t have any rule to revise. In contrast, to change the set of successor neighbor’s cause rules (Nc+(ri) ∪ Nc-(ri)) which do not generate a cascade of the change because the designer, in this case, should only revise the event and post event part of the rules concerned. In the process example, if we change R4, then rules R2 will be revised. This revision consists of updating the post event to ensure that this letter does not activate a deleted rule (as we explained above). After this update, we do not need to revise another set of successor neighbor’s rules. The following algorithm shows the change impact of a rule ChangeImpact_Procedure (Rx , stack S) { if NotExist(S, RX) then // test if the rule’s stack S contains the rule RX { push (S, RX); // push rule RX onto stack S } if NotExist(S, Nc-( RX)) then RX { push (S, Nc-( RX)); } if NotExist(S, Nc+( RX)) then { push (S, Nc+( RX)); } if Ni+( RX) ≠ Φ then { ChangeImpact_Procedure (Ni+( RX),S); }Else { if Ne+( RX) ≠ Φ then {ChangeImpact_Procedure (Ne+(RX),S); } Else { exit ();}} }

In previous process, rule R4 change cascade (R2, R6, R7, R8, R9, R10, R11 and R12) needs to be revised in order to ensure the activation of all the rules and the business coherence of the process as well.

4 Business Process Self-healing The second aim of our work is to ensure the reliability of a business process through self-healing. Indeed, the change of rules may raise exceptions at run-time if not properly reflected on these processes. For this reason, we propose a self-healing strategy for the process on the basis of the ECAPE formalism. This requires going over two steps:

22

M. Boukhebouze et al.

4.1 Exceptions Recognition Exception recognition attempts to identify any risk of exceptions before the implementation of a process occurs. In this paper we are interested in detecting exceptions that are related to functional coherence of a business processes. Such exceptions could come from a poor design for example infinite loops and process non-termination. To help designers in detecting early these errors it is useful to perform a high-level modeling verification in order to provide a reliable operational process However, to identify these functional errors we should have a process data state. Moreover, this verification cannot be done if an execution scenario is not available. In the case of a declarative modeling it is often difficult to have such a scenario at the modeling time. To address these problems, we propose to use a cause/effect sub-graph of rules graph (Fig. 3) in order to verify the functioning of the business process. In such a sub-graph we consider only the cause/effect relationships between rules (figure 3.A). The use of this sub-graph for verification of an ECAPE process is backed by the fact that this latter represents how the process rules set is activated. As a result, a cause/effect subgraph formalizes the process functioning. For illustration purposes we adopt the live-lock case. This case occurs if a sub set of rules behave like an infinite loop, which puts a process in an endless state. This could be due to a poor analysis of the rules that are executed. In the previous example, if rule R9 is changed to allow customers add articles to the same bill (figure 3.B), then the new rule R9 will rerun the process by activating rule R2. As a result, the cause/effect sub-graph contains two circuits (R2, R3, R7, R12, R8 and R9) and (R2, R4, R6, R7, R12, R8 and R9). Both circuits represent loops in the process and both may be infinite. To determinate whether a circuit in a cause/effect sub-graph can be terminated, we need to have a data state. However, in process modeling, such a data state does not exist.

(A)

(B)

Fig. 3. The cause/effect sub-graph of the purchase order process

Description of Flexible and Self-healing Business Processes

23

For this reason, each circuit could be now considered as a risk of infinite loop. As a result, rules in each circuit will be identified for testing in the execution phase. 4.2 Exceptions Handling As mentioned in the previous section, exceptions recognition attempts to detect risks of exceptions by identifying the process part that can eventually cause such exceptions. However, an exception handling step is necessary to monitor these parties at run-time, and to react in case these exceptions become effective. The aim of this verification is to avoid the business process to be in an unstable situation. For this reason, the exception handling is launched in parallel with the execution of the process. In this way, this exception handling tries to respond to a situation that would destabilize the process performance by executing compensation codes. For do this, the exception recognition marks the process parts, which will likely lead to exceptions by markers called check-point. This is useful for keeping track of these parties in the executable process.

Fig. 4. The addition check-point in the ECAPE rule codes

Fig. 5. The addition check-point in the BPEL process codes

Indeed, after the exception recognition step is over, check-points are added to the ECAPE process code. When translating ECAPE process into an execution process code (as BPEL), these markers are also translated and added to the execution code. The result contains the operational business process code and also the check-point associated with a process’s parts that may produce exceptions and deserve special monitoring. Exception handling is launched in parallel with the execution of the process. A runtime engine interprets the process code by executing the process activities described in BPEL for example. If the runtime engine meets a check-point, the execution of the process code is stopped and a routine associated with the number of checkpoint is called. The aim of this routine is to verify whether an exception occurred in the executable process. In case the exception occurs, the routine launches an alternative remedy of the exceptional effect. Indeed, this remedy can concern a compensation code, which reaches the process into a more stable situation or substitution of the unavailability Web service (or application) needed to execute one business activity.

24

M. Boukhebouze et al.

In following we detail how the exceptions handling can manage live-lock exception. Indeed, as we saw previously, due to lack of data state in the modeling phase, this exceptions recognition cannot determine the finite nature of a circuit of a cause/effect sub-graph. To this end, all the rules of each circuit will be marked by adding check-points to its codes in order to enable the monitoring of this circuit in the execution phase. However, to optimize the addition of these markers, two checkpoints are added by circuit: the first is added to the action of the starting rule circuit. The second is added to the action of the ending rule circuit. The justification for this choice is explained thereafter. For instance, to manage the two circuits of the cause/effect sub-graph in the purchase order process (Figure 3.B), a check-point is added per circuit in the action code of rule R2 (the starting rule of the two circuits) and in the action code of rule R11 (the ending rule of the two circuits) (figure4). This definition will be translated into a script execution expressed in BPEL for example. The check points will also be translated by placing them in the script code associated to the rule’s action execution code (Figure 5). This will help keep track of the circuits in the executable process in order to monitor the infinite loops. Indeed, when the runtime engine meets a check-point management loop, the process execution is stopped and a routine associated with this check-point is called. This routine will check if the process is in a state where it constantly rotates (live-lock) by bases on the data state of the process. A state data is defined as follows: Definition 3. A data state of one process at a time t, noted β (t ) , is the vector of process values at a time t. The check-point handling loop routine will test the data process by considering the following property. Property. In a cause/effect sub-graph, a circuit is finite if the two following properties are verified: 1) All the process variables belong to a boundary interval 2) The change of data is respected as ∀t , ¬∃t ' / β (t ) = β (t ' ) According to this property, completing a loop requires that the data state changes over time, i.e., at least one of the process variables must change in each loop iteration. In the previous example, exception handling will ensure that the data state changes in each iteration (adding an article, deletion of an article, etc.). If the process receives the same information in one command instance this means that the process has entered an infinite loop. Based on this property, the check-point routine of a starting rule in a circuit (in the preceding example R2) compares the data state of the current iteration with the data states of all previous loop iterations. If the routine detects a recurring data state, the loop is infinite. In this case, the routine will launch a compensation code in order to lead the process execution to a valid situation. On contrary case, the routine backup the execution hand to runtime engine to continue normally the execution of process until the meeting of another check point. If this time it is a check-point of an ending rule circuit (in the preceding example R11) this means that the runtime engine has completed to execute the loop and it is executing other process parts. In this case, the routine will

Description of Flexible and Self-healing Business Processes

25

remove all the previously data states saved during the various loop iterations. This is why the check-point is added only on to the starting and ending circuit rule.

5 Related Work The rule-based approach proposes to model the logic of the process with a set of rules using declarative languages. According to Wagner in [8], the rules models can be classified, in accordance with to MDA architecture (figure 6). Indeed, business rules models, supported by languages, are proposed to formalize rule expressions. Indeed, the rules formalism used in these models depend to what categories of rule they represent. An example of this is OCL [9] which is used to express integrity rules and derivation rules in conjunction with UML models. PENELOPE [10] is another language that uses the deontic logic to formalize the rules in terms of obligations and authorizations that feature business interactions. Note that, some general rule markup languages are proposed. These languages can be used for interchanging rules between different rule languages like RuleML [11] and R2ML [8]. However, according to Knolmayer et al. in [12] the reaction rules (ECA) are the most adapted to model business rules. Giurca et al. in [7] justified this by the fact that this kind of rules is easier to maintain and it cover all other rules kinds (Integrity, deviation, production, and transformation).This is done in various works, like the AgentWork framework of Müller et al. in [13], where ECA rules are used for temporal workflow management, or in the work of Zeng et al. in [14] that considers a process as a set of tasks coordinated between them by ECA rules and use agents to encapsulate services that perform the process tasks. Our work is positioned in ECA rule category. However, in the aforementioned declarative process modeling languages using this formalism, the modeling flexibility with focus on the impact of a rule change on the rest of a process is not well looked into. Therefore, there is a need for a more powerful formalism that would allow a complete definition of this relationship. This is why we chose the ECAPE formalism.

Fig. 6. Rule models and languages at different levels of abstraction [8]

26

M. Boukhebouze et al.

Finally, the execution rule models is proposed in order to formalize the execution of the rules set as ILOG JRules. However, these execution rule models do not allow having an explicit execution scenario. As a result a more powerful paradigm is deemed appropriate in order to translate, in an easy way, a business process into a formal model and ensure the process verification allowing to building an execution scenario in an automatic way. This is why we opted for the use of ECAPE formalism.

6 Summary In this paper we proposed a new rule based model that aims at tacking the following two issues: the implementation of business rules in a business process code makes this process rigid and difficult to maintain, and the lack of mechanisms to support the verification process. For this reason, ECAPE formalism is used in order to describe a business process using a set of business rules that are translated into a rule graph. The analysis of this graph guarantees the solving of thee two aforementioned issues: the flexibility of business processes modeling and the self-healing of the business process. In the future, we aim to extend the model in order to propose a vocabulary metamodel. Another future aim is to optimize the operational process by analyzing, in diagnostic phase, the events historic.

References 1. Goedertier, S., Vanthienen, J.: Compliant and flexible business process with business rules. In: 7th Workshop on Business Process Modeling, Development and Support (BPMDS 2006) at CAiSE 2006, pp. 94–104 (2006) 2. The Business Rules Group, Defining Business Rules, What are they really (July 2000), http://www.businessrulesgroup.org 3. OASIS: Business Process Execution Language for Web Services (BPEL4WS): Version 2.0. In BPEL4WS specification report (2007) 4. Lu, R., Sadiq, S.: A Survey of Comparative Business Process Modeling Approaches. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 82–94. Springer, Heidelberg (2007) 5. Regev, G., Soffer, P., Schmidt, R.: Taxonomy of Flexibility in Business Processes. In: Seventh Workshop on Business Process Modeling, Development, and Support In conjunction with CAiSE 2006 (2006) 6. Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era. Technical report, IBM 7. Giurca, A., Lukichev, S., Wagner, G.: Modeling Web Services with URML. In: Proceedings of Workshop Semantics for Business Process Management 2006 (SBPM 2006), Budva, Montenegro, June 11 (2006) 8. Wagner, G.: Rule Modeling and Markup. In: Eisinger, N., Maluszynski, J. (eds.) Reasoning Web. LNCS, vol. 3564, pp. 251–274. Springer, Heidelberg (2005) 9. Object Management Group, Object Constraint Language (OCL) (2003), http://www.omg.org/docs/ptc/03-10-14.pdf 10. Goedertier, S., Vanthienen, J.: Designing compliant business processes with obligations and permissions. In: Eder, Dustdar, pp. 5–14 (2006)

Description of Flexible and Self-healing Business Processes

27

11. Schroeder, M., Wagner, G.: Languages for Business Rules on the Semantic Web. In: Proc. of the Int. Workshop on Rule Markup, Italy, June 2002, vol. 60. CEUR-WS Publication (2002) 12. Knolmayer, G., Endl, R., Pfahrer, M.: Modeling Processes and Workflows by Business Rules. In: van der Aalst, W.M.P., Desel, J., Oberweis, A. (eds.) Business Process Management. LNCS, vol. 1806, pp. 16–29. Springer, Heidelberg (2000) 13. Müller, R., Greiner, U., Rahm, E.: AgentWork: a Workflow System Supporting RuleBased‘ Workflow Adaptation. Data & Knowledge Engineering 51(2), 223–256 (2004) 14. Zeng, L., Ngu, A., Benatallah, B., O’Dell, M.: An Agent-Based Approach for Supporting Cross-Enterprise Workflows. In: Proceedings of the 12th Australasian Database Conference, ADC 2001 (2001)

Business Process Aware IS Change Management in SMEs Janis Makna Department of Systems Theory and Design, Riga Technical University, Latvia, 1 Kalku street, Riga, LV-1658, Latvia [email protected]

Abstract. Changes in the business process usually require changes in the computer supported information system and, vice versa, changes in the information system almost always cause at least some changes in the business process. In many situations it is not even possible to detect which of those changes are causes and which of them are effects. Nevertheless, it is possible to identify a set of changes that usually happen when one of the elements of the set changes its state. These sets of changes may be used as patterns for situation analysis to anticipate full range of activities to be performed to get the business process and/or information system back to the stable state after it is lost because of the changes in one of the elements. Knowledge about the change pattern gives an opportunity to manage changes of information systems even if business process models and information systems architecture are not neatly documented as is the case in many SMEs. Using change patterns it is possible to know whether changes in information systems are to be expected and how changes in information systems activities, data and users will impact different aspects of the business process supported by the information system. Keywords: business process, information system, change management.

1 Introduction Business process (BP) changes may be introduced because of different reasons inside and outside the process [1], [2]. Changes may range from small incremental changes to time consuming business process reengineering projects [3], [4]. Taking into consideration the fact that almost all business processes are computer system supported, changes in the business process, in many cases, cause changes in the information system (IS), and changes in the information system may cause further changes in the business processes. So, at a particular time point, regardless of the initial reason of changes, the IS becomes a change object in the process of business process improvement or reengineering. One of the ways how changes could be managed is development of well elaborated business process models which are related to the well elaborated information technology (IT) architectures [5]. However, small and medium enterprises (SMEs) can rarely afford the time and financial resources for the business process model and IT architecture documentation and maintenance, - not only because of the initial effort needed for this type of activities but also because of very frequent changes in J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 28–42, 2009. © Springer-Verlag Berlin Heidelberg 2009

Business Process Aware IS Change Management in SMEs

29

processes and architectures that may overrun any documentation efforts. In this paper we discuss a different approach to IS change management in SMEs, that is based on the use of change patterns to anticipate a full range of changes if a particular change element changes its state in the SME. The change patterns are detected by theoretical analysis of IS change literature, enterprise architecture frameworks and information systems related theories. They are checked on 48 information systems change cases in SMEs, and represented in the IS change management tool prototype. The paper is structured as follows. Research approach is briefly discussed in Section 2. In Section 3 we introduce basic elements of the change patterns and their options of change. Section 4 reports on three basic change patterns and applies them in the context of BP changes management. Section 5 consists of brief conclusions and directions of future work.

2 Research Approach The IS change patterns discussed in this paper were obtained by identifying and analyzing basic elements of change in theoretical and practical IS change situations. The research is based on the following two assumptions: 1.

2.

If one of the basic elements changes, the BP and/or IS loses its relatively stable state and the change process starts, which aims at achieving a new stable state. All basic elements are related, the strength of change propagation may differ depending on the type of change situation, therefore not all basic elements are to be changed to achieve a new relatively stable state.

Thus, the change patterns were identified by analyzing relatively stable states before and after IS change projects. The research approach consists of the following activities: • • • • • •

Analysis of IS and BP definitions to find the objects of change - elements which alter during the changes. Testing the relevance and completeness of the set of identified change elements. Analysis of IS theories and change management literature to identify change options of elements and to identify sets of mutually dependent elements. Stating a hypothesis concerning basic change patterns. Testing the hypothesis on real-life IS change projects. Building a prototype of the IS change management tool for SMEs.

The basic change elements which characterize IS were identified by analyzing several IS definitions [1], [2], [4], [6], [7]. As a result, data, IS activities and IS users found as the most referred to IS change objects (Fig. 1). Available BP definitions were divided according to the aspects they describe: (1) definitions, which are based on process theory; (2) definitions, which are based on collaboration between several BPs; (3) definitions, which are based on BP activities or transformations. Basic IS change elements, which are common to all BP definitions

30

J. Makna

Control 6

Teritory 7

Knowledge 2

IS IS Activities 4

Data 1

Users 3

Resources 8

BP BP Activities 5

Product 9

Fig. 1. IS change elements

and characterize BP with respect to IS changes, were identified by comparing the above mentioned types of BP definitions. These identified change elements are BP activities, users, knowledge, control, territory where activities take place, resources, and the product (Fig. 1). The obtained set of elements (all elements in Fig. 1) was analyzed with respect to different enterprise architecture frameworks for relevance of the elements and completeness of the set [8], [9], [10], [11], [12], [13]. To check the relevance of elements the test was performed how each element is represented in perspectives of enterprise architectures. The following organization architectures were considered: TOGAF[14], RM-ODP[15], Zachman[16], DOD[17], Geram[18], Cimosa[19]. The results of analysis show that the elements data, knowledge, users, IS activities and BP activities have corresponding views in each architecture, while the elements territory, control, resource and product are represented in several views. Fact that all elements are represented in organization architecture indicated that the elements are relevant. In order to verify the completeness of the identified set of IS change elements, we tested the possibility of representing enterprise architectures views by the elements of the set. It was found that on high level of representational detail change elements could cover all views of enterprise architectures [14], [15], [16], [17], [18], [19] Thus the set of elements was considered to be complete with respect to organization architecture frameworks. The identified set of elements was used to reveal patterns of element changes in several IS theories and change management literature. In its turn, the identified patterns of element changes were analyzed to identify the most frequent change patterns. The change patterns were tested against 48 IS change projects that were accomplished in different SMEs during the last decade. The duration of the projects varied from 6 months to 5 years. The SMEs were public and private institutions involved in different types of business: trade companies, financial institutions, transportation companies. The results of analysis approved identified change patterns and the most frequent patterns from the theoretical point of view were the most frequent ones in the above mentioned IS change projects, too.

Business Process Aware IS Change Management in SMEs

31

3 Basic Change Elements In this section we introduce basic elements of change patterns (Fig. 1). They are data, knowledge, users (IS users), IS activities, BP activities, control, territory (actually it means the place where BP supported by IS is carried out), resources (other than already mentioned elements), and products. In Fig. 1 a number is attached to each element to simplify further description of elements and their complementary change patterns. Each change element has several options of change. The change pattern consists of a particular set of change options of certain basic change elements. The fact that change options are amalgamated in one and the same set (pattern) means that these changes are likely to complement one other in certain change situations. To identify change patterns it is necessary to know the change options of elements and relationships between change options of different elements. In order to identify change options and relationships between them the following change relevant theoretical sources were analyzed: (1) more then 60 theories related to information systems [20], (2) methods of IS and BP change management and reengineering. Change options of elements and the relation between them were identified by answering two questions: (1) what changes take place in each element and (2) what elements must be changed according to a particular theory or method. By answering the first question, all change options of each element were identified (Table 2). Complementary changes in several elements specify connections between these elements. Thus, by answering the second question, interconnections between elements during changes were identified. All theories were divided into 14 groups according to different IS change aspects considered in these theories. The groups of the theories are presented in Table 1. The sets of elements which change their state according to a certain group of theories are presented in Table 3. Table 1. Groups of theories used for identification of change patterns

N Description of group of theories o 1

2

Theory of administrative behaviour [21] specifies that employees have restrictions of cognition. Organizational knowledge creation theory [22] specifies that it is necessary to improve or create new knowledge during IS change situations. According to theory of administrative behaviour [21] the knowledge is related with performance of employees. According to language action perspective theory [23] activities of employees are taking place via communication between employees. During communication exchange of knowledge and data takes place. Transactive memory theory [24] also specifies exchange of knowledge and data by employees. More detailed exchange process describes knowledge-based theory of the firm [25] specifying such options as receiving, transferring and creation. The receiving of knowledge confirms the knowledge-based theory of the firm [25] and specifies that the organization needs new knowledge, which is outside of organization. Theory of administrative behaviour [21] specifies that restriction of cognition requires bringing in knowledge from outside of organization. From this theory follows that exchange of knowledge, data and activities between employees takes place. Such exchange is confirmed by agency theory [26] and principal agent problem [27]. According to agency theory [26] and principal agent problem [27] handing over activities requires to hand over knowledge and to receive data. The data in this case characterize performance of activities.

32

J. Makna Table 1. (Continued)

3 4

5

6

7

8

9

10

The theories referred in Row 2 of this table point to another situation when activities and knowledge are received and data are handed over by the BP. Several theories examine relationships between activities of employees and data. Media richness theory [28] specifies that organization is processing the information to reduce uncertainty and doubtfulness in organization. According to argumentation theory [29] and description of Toulmin’s layout of argumentation [30] employees make decision based on data, facts or information. In this connection the data quality must be improved to improve decision. It confirms cognitive fit theory [31]. According to cognitive fit theory [31] data presentation about activities improves performance of activities. It is consequently possible to assert that quality of data is related with activities which are performed by employees. According to transaction cost theory [32] organization grows until cost of transaction does not exceed cost of the similar transaction in the market. To meet the conditions of theory, organization should perform the following analyses: (1) define enterprise BP, (2) identify costs of BP, and (3) compare BP costs with similar BP in market. Based on these analyses organization provides the following changes in elements: (1) improves data quality and create new data if it is necessary, (2)improves IS activities to obtain new data, (3) improves BP activities to decrease the cost of activities, (4) improves data and information exchange between employees, improve control of BP and product. In accordance with resource theories, the organization: (1) uses renewed or new resources as requires dynamic capabilities theory [33], (2) creates special buffers of resources or implement structural mechanisms and information processing to reduce uncertainty as required by organizational information processing theory [34], and (3) uses resources, which is hard to imitate or substitute as required by resource-based view of the firm [35]. According to these theories, organizations need to identify information and data about characteristics and accessibility of resources. In order to obtain new data about resources, organization changes the IS activities and improves or creates new knowledge about resources. As a result, BP activities, usage of resources and BP product improve. S-curve and technology adoption theory [36] proposes three stages of organizational growth. To provide the transition from one stage to another the following changes must happen in organization: (1) new data is identified or data quality is improved, (2) new knowledge is identified or knowledge quality is improved, (3) the quality of IS activities is improved to support new data and knowledge, and (4) BP activities and control are improved. Reengineering methods suggest two ways how to improve BP: (1) to reduce the cost of production and to create the different product [37]. To reduce the cost of production it is necessary to know data about BP activities, resources and control; (2) 0rganization rebuilds or redesigns BP to decrease cost of control, because up to 95 % of time that is used for controlling does not add value to BP product [38]. To create a different BP product, organization clarifies product users’ requirements. New product is created by changing product functionality. In organization this requires the following changes: (1) to improve or create new data, (2) to improve or create new knowledge, (3) improve IS and BP activities. BP reengineering methods propose three BP improvement dimensions: organizational structure, management, and human resources [3]. When the organizational structure alters, BP activities change. Some BP activities are handed over or received from other BPs. During the transfer of activities the territory

Business Process Aware IS Change Management in SMEs

33

Table 1. (Continued)

11

12

13

14

where the activities take place changes. To support activity transfer the data and knowledge are to be transferred, too. A similar transfer takes place during changes in management. Management based on organizational structure is replaced by management based on information. It means that lower hierarchical level employee receives new information and new knowledge to perform new activities. Higher hierarchical level hands over some activities and receives the data about fulfillment of activities. These changes fall into two types. The first type of changes point to handing over BP activities, knowledge and territory where activities are performed and to receiving data about the performance of activities. The second type of changes are presented in next subdivision of this table. The theories referred at in Row 10 of this table point to the second type of changes mentioned in Row 10, namely, to receiving of activities, knowledge and territory where activities are performed and to handing over the data about the activities. During the changes in human resources, the individual task executers are replaced by teams. Team consists of employees from different departments which execute different tasks. Thus the team task execution reduces the time of coordination and control between different departments. The team members receive knowledge and activities and send data about the fulfillment of activities. Functional specialists are replaced with process executers during changes in the human resources. As a result, new users of IS require the data. Data quality improves and new data and knowledge are created to support changes in BP activities. Consideration of the human resource dimension proposes to view knowledge of organization as organizational resource instead of using experts as functional specialists. The knowledge of experts are integrated into BP activities, IS or BP products.

All basic change elements and their change options are shown in Table 2. The first column shows the number of the change element; the second column shows the name of the element. Change options are reflected in the third column. For all elements one of the change options is “No change”. This option is not listed in the Table 2. The fourth column of the table is used for brief explanation of change option. The last column indicates references to the sources where change options were indicated. The options are explained taking into consideration that the BP under the discussion is related to other BPs and may take over from or delegate different change elements to other processes. The elements may overlap, however, their mutual dependencies are not considered, because large number and variety of dependencies do not allow elaboration of theoretically obtained and practically approved patterns of complementary changes. Change options for each element are mutually exclusive. Change options of all elements are presented in Table 3, where the following abbreviations are used “Impr” means Improvement, “Rec” means Received, “New” means New data or New knowledge. Each row in Table 3 represents a specific change pattern derived from a particular group of theories (Table 1). Some patterns from Table 3 overlap (for example 1 and 4 or 7 and 8 and 9) thus it is necessary to reduce the number of patterns. Therefore, it is necessary to define a new set of patterns that includes all specific change patterns from Table 3. To define this set of patterns an element that has change options in all specific change patterns is used.

34

J. Makna Table 2. Change elements and change options

N o

Element

Change option

Explanation

Theory references

1

Data

Received

BP receives data it did not possess before the change Gives data over to another BP

[25], [24], [26], [27], [3] [23], [24], [3]

Generates new data inside the BP

[33],[34],[35],[36],[37] [28],[29],[30],[31],[3 2],[33],[34],[35],[36] [21],[22],[25],[33],[34] , [35], [36], [37], [3] [23], [24], [25], [26], [27], [3] [23], [24], [25], [21], [26], [27], [3] [3] [3]

More expensive Different

The quality of existing data is improved Knowledge is obtained internally during the change During the change knowledge is given to another BP During the change knowledge is received from another BP IS after change is used by new users After changes users start to use another IS More activities are performed by IS after the change Less activities are performed by IS after the change Some activities are handed over to another BP (IS) Some activities are received from another BP (IS) During the change activities were taken over by another BP During the change activities were taken over from another BP The activity becomes more intensive, larger, or smaller. BP benefits from this change. Control requires less time, becomes simpler, becomes less expensive, etc. The activities “geographically” are performed in the territory of another BP after the changes Before the changes certain activities “geographically” were performed in the territory of another BP After changes resources become cheaper After changes resources become more expensive The change of resources

[33], [34], [35], [37]

Improved

Improved in all possible ways

[32],[33],[34],[35],[3]

Handed over New Improved 2

Knowledge

New Handed over Received

3

IS users

New Moved

4

IS activities

Extended Suspended Handed over Received

5

BP activities

Handed over Received Improved

6

Control

Improved

7

Territory

Handed over Received

8

9

Resources

Products

Cheaper

[32], [33], [34], [35], [36], [37] [3] [3] [3] [26], [27], [3] [26], [27], [3] [21],[22],[28],[29],[30], [31], [32], [33], [34], [35], [36], [37], [3] [32], [36], [37], [38], [3] [3]

[3]

[33], [34], [35], [37] [33], [34], [35], [37]

Business Process Aware IS Change Management in SMEs

35

This element is BP activities. All specific change patterns from Table 3 are grouped according to change options of element BP activities. Thus the following three pattern of changes are obtained: 1) The first pattern called “Internal” depicts changes of elements when BP improves using BP internal possibilities. The change option of element BP activities is “Impr”. 2) The second pattern called “Extrec” depicts changes of elements when BP receives activities from related BP or external environment. The change option of element BP activities in this pattern is “Rec”. 3)

3 4 5 6 7 8 9 10 11 12 13 14

Handed over Rec Impr New Impr New Impr New Impr New Impr New Impr New Rec Handed over Handed over Impr New Impr New

Rec

Rec

Handed over

Handed over Impr Impr

Impr New Impr New Impr New Impr New Handed over Rec

Impr

Territory

Impr

Impr

New

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Impr

Handed over Rec

Impr

Impr Impr

Impr

Impr

Impr

Rec

New

Impr

Impr New

New

Impr

Handed over Impr

Impr

Impr

Impr New

Control

BP activities

IS activities

Impr

Product

2

New Impr

Resources

1

IS users

Data

No

Knowledge

Table 3. Element change options according to theory groups

Impr Impr

Impr

Hande d over Rec

The third pattern depicts changes of elements when BP sends some of activities to related BP. The change option of element BP activities in this pattern is “Handed over”. These basic patterns are described in Section 4 and are shown in Table 6.

36

J. Makna

The research hypothesis that these three basic patterns are the dominant ones in IS change situations in SMEs was tested by analyzing 48 real-life IS change management projects. All 48 projects took places in small and medium enterprises and were related with IS and BP changes. The duration of the projects varied from 6 months to 5 years. The SMEs were public, private and government institutions involved in different types of business: trade companies, financial institutions and transportation companies. During the projects, the state of each element was registered before and after the changes. Each state of elements was characterized by change options as shown in Table 2. All results were presented in a table where each row and column represents a particular change option. There are 23 options from Table 2 and 9 options “no change” of each element. Thus, a table with 32 rows and 32 columns shows all statistics of change options. For each change option related changes in all IS projects were counted and the sum of them represented in cells of corresponding columns. An example of part of the table is shown in Table 4. The first column of the table shows the name of change option. The second column of the table shows the number of occurrences of the change option. The rows in Table 4 showsthat BP activities did not change in 5 cases, in 10 cases they were handed over to another process and received from another process, and in 23 cases BP activities were improved.

2

…

…

…

3

…

1

1

3

Other change options

16

4

1

IS activity extended

23

4

1

3

…

2

…

1

9

…

15

…

…

…

IS activity Suspended

3

2

1

IS activity received

10

2

IS activity Hand over

1

IS activity no change

10

Data improved

4

Data received

Data new

5

Data hand over

Exp. number BP activity no change BP activity Hand over BP activity Received BP activity Improved Other change options

Data no change

Table 4. The results of analyses of 48 IS change projects

4

3

2

5

1

2

…

…

…

…

…

…

In cases when element BP activities had changed, the related changes occurred as follows: 16 cases with option new in element data, 2 cases with option no change in element data, no cases in element data hand over change option, 3 cases with element data option received, 2 cases with element data option improved, and etc. In the same way data that correspond to all other change options were represented. The next step was to identify the strength of relationships between elements. The strengths of relationships were identified using category data analyses method [39].

Business Process Aware IS Change Management in SMEs

37

The category data analyses method allows to identify strength of relationships between several elements based on the amount of experiments. The number of experiments is important in this research, because there are different numbers of occurrences for particular change options. Category data analyses method allows to identify relations between matrix elements and numerically characterize relationships between them. Numerical characteristics of relationships enable to distinguish between strong and weak relationships. According to category data analyses method [39] the two variables of a matrix are independent if the value of matrix row i and column j are equivalent with nia * naj / n. Where: n – number of experiment, nia – total of row i, nja – total of column j. Thus deviation from independence in this cell can be expressed with equation (1). Dij = nij – nia * naj / n

(1)

Equation (1) was applied to all variables of change options representation table exemplified by Table 4. As a result a new table, which shows deviation from independence for all change options represented by rows, was obtained. Part of these results is illustrated by Table 5.

Data new

Data no change

Data hand over

Data received

Values of the rest elements

BP activity No change BP activity Hand over BP activity Received BP activity improved Values of the rest element

Exp. number

Table 5. Example of result derived with category data analyses method

5

1.5

-0.416

-0.625

-1.04

…

10

-4

1.16

0.75

1.91

…

10

-2

-.083

2.75

0.916

…

23

4.5

0.084

-2.87

-1.79

…

…

…

…

…

…

…

Results of category analysis (exemplified in Table 5) made it possible to evaluate the strength of relationships between change options. The highest values in the cells of the table (except of column 1) correspond to the strongest relationships between elements of corresponding rows and columns. Thus it is possible to identify which relations between change options in patterns are dominant. Category analysis of IS change project data approved that relations between change options in three above mentioned change patterns are with considerably stronger relationships than relations between other change options. The patterns and change options of elements are described in more detail in the next section.

38

J. Makna

4 Basic Change Patterns for IS Change Management Three change patterns and change options of elements are reflected in Table 6. Pattern Internal refers to internal changes in one particular BP and IS that supports it. Patterns Extrec and Extsend refer to a situation when changes affect cooperation of several business processes. Each pattern involves a different set of changes of basic change elements. In each basic IS change pattern it is enough to know the new state of one change element to anticipate other changes in IS and BP elements, which are to happen when moving into relatively stable new state of BP and/or IS. Table 6. Most common complementary change patterns

No

Element

Internal

Extrec

Extsend

1

Data

New or Improved

Received

Handed over

2 3

Knowledge IS users

New No change

Handed over New

Received Moved

4

IS activities

Extended

Received

5 6

BP activities Control

Improved Improved

Suspended or Handed over Handed over Improved

Received Improved

7

Territory

No change

Handed over

Received

8

Resources

Cheaper or Different

Cheaper

More expensive

9

Products

Improved

Improved

Improved

Pattern Internal usually occurs in situations where the aim is to obtain new data about the business process or improve its activities [1], [4]. During changes from one relatively stable state to another the quality of data of the existing BP is improved and/or new data is obtained that give an opportunity of a more detailed BP analysis. This requires extension of IS activities with new data storage functions. Due to new/improved data new knowledge becomes available. This new knowledge causes changes in BP activities, control is improved, and the product is improved. No change is needed in such change elements as BP territory and IS users. Pattern Extrec occurs in situations where part of BP activities is handed over to another BP. From IS point of view it is indicated by receiving new data, new users and changes in IS activities. Handing over particular activities enables improvement of the control of the process and the use of cheaper resources as part of former activities is performed in another territory. To enable another BP to take over the activities it is necessary to support it with knowledge about the activities. The process still needs data about former activities, therefore data is received from another BP or new users added to IS.

Business Process Aware IS Change Management in SMEs

39

Pattern Extsend is similar to Pattern Extrec. The difference is that BP receives new activities instead of handing them over. Knowledge is to be received together with the activities and data about activities sent to the process from which the activities were received, or new users added to IS. While a particular sequence of events was used in the aforementioned pattern descriptions, it characterizes only one possible sequence of events inside the pattern. The main emphasis is on the possibility to ascertain that if a particular change pattern is identified then all the needed complementary changes are taken care of (e.g., it is not forgotten to transfer knowledge together with the activities, which are handed over to another BP).

1

2

3

Fig. 2. The main window IS change management tool

A prototype of IS change management tool was developed for practical use of patterns. The purpose of the prototype of tool is: (1) to check the completeness in the IS and BP change projects, (2) to identify new directions of IS and BP changes, (3) to predict the IS and BP changes if one of the elements changes. The tool supports the following functions: • • •

Choice of appropriate basic change pattern (by description of IS and BP change project). Identification of element changes. Representation of pattern analysis results.

40

J. Makna

The main window of the tool is presented in Figure 2. Part 1 of the window presents the list of organizational BP. The list of details about the process is presented in browse and sub-windows. Part 2 of the window presents patterns. Here it is possible to choose one of three basic change patterns and see what changes are essential in each pattern. Part 3 of the window shows the list of other BPs (for patterns Extrec and Extsend) with corresponding details of these processes. This part is necessary to show relation between elements of several BPs involved in change process.

5 Conclusions Changes in the business process and changes in the information system that support the process usually complement one another. In many situations it is not even possible to detect which changes are causes and which changes are effects. However, this research has shown that it is possible to identify sets of changes that usually happen when one of the business process or information system elements changes its state. These sets of changes may be used as patterns for situation analysis in change cases to anticipate full range of activities needed to be performed in IS change management in SMEs. The paper presents theoretically derived main change elements, change values, and change patterns where changes of states of several elements are amalgamated. Those patterns were analyzed according to different information system and change management theories and tested in 48 real information systems change cases in SMEs. Both theoretical and empirical research results have pointed to three basic change patterns. Based on these three patterns the prototype of tool for supporting information systems change management in SMEs is under development. The approach presented in the paper gives an opportunity to improve information systems change management by checking whether all potentially needed changes are planned and introduced into information system and business processes. The approach discussed in this paper is designed and tested for IS change management in SMEs. Applicability of the method for large companies has not yet been investigated. The future research is concerned with developing IS change knowledge base for monitoring usability and relevance of existing patterns and discovery of new change patterns that may occur because of the use of currently unknown new business and IS solutions.

References 1. Maddison, R., Dantron, G.: Information Systems in Organizations. Improving business processes. Chapman & Hall, Boca Raton (1996) 2. Mumford, E.: Redesign Human Systems. Information Science Publishing, United Kingdom (2003) 3. Teng, J.T., Grover, V., Fiedler, K.D.: Initiating and Implementing Business Process Change: Lessons Learned from Ten Years of Inquiry. In: Grover, V., Kettinger, W. (eds.) Process Think: Winning Perspectives For Business Change In The Information Age, pp. 73–114. Idea Group Publishing, United Kingdom (2000) 4. Harrington, H.J., Esselding, E.C., Nimwegen, H.: Business Process Improvement. Workbook. Documentation, Analysis, Design and Management of Business Process Improvement. McGraw-Hill, New York (1997)

Business Process Aware IS Change Management in SMEs

41

5. Skalle, H., Ramachandran, S., Schuster, M., Szaloky, V., Antoun, S.: Aligning business process management, service-oriented architecture, and Lean Six Sigma for real business results. IBM Redbooks (2009) 6. Spadoni, M., Abdomoleh, A.: Information Systems Architecture for business process modeling. In: Saha, P. (ed.) Handbook of Enterprise Systems Architecture in Practice, pp. 366– 380. IGI Global (2007) 7. Daoudi, F., Nurcan, S.: A Benchmarking Framework for Methods to Design Flexible Business Processes. In: Software Process Improvement and Practice, pp. 51–63 (2007) 8. Goikoetxea, A.: Enterprise Architecture and Digital Administration: Planning Design and Assessment. World Scientific Publishing Co. Pte. Ltd., Singapore (2007) 9. Zachman, J.: A Framework for Information Systems Architecture. IBM Systems Journal 26(3) (1987) 10. Goethals, F.: An Overview of Enterprise Architecture Deliverables, http://www.cioindex.com/nm/articlefiles/ 64015-GoethalsOverviewexistingframeworks.pdf 11. Diehl, M.: FEAF level IV matrix, http://www.markdiehl.com/FEAF/feaf_matrix.htm 12. Zacarias, M., Caetano, A., Magalhaes, R., Pinto, H.S., Tribolet, J.: Adding a human perspective to enterprise architectures. In: Proceedings of 18th International workshop on database and Expert systems applications, pp. 840–844 (2007) 13. Robinson, P., Gout, F.: Extreme Architecture Framework: A minimalist framework for modern times. In: Saha, P. (ed.) Handbook of Enterprise Systems Architecture in Practice, pp. 18–36. IGI Global (2007) 14. http://www.ibm.com/developerworks/library/ar-togaf1/#N10096 15. Reference Model of Open Distributed Processing, http://en.wikipedia.org/wiki/RM-ODP 16. Extending the RUP with Zachman Framework, http://www.enterpriseunifiedprocess.com/essays/ zachmanFramework.html 17. DoD Aarchitecture Framework. Version 1.5, vol. 2. Product Description, http://www.defenselink.mil/cio-nii/docs/DoDAF_Volume_II.pdf 18. GERAM: Generalized Reference Architecture Enterprise and Methodology. Version 1.6.3. IFIP – IFAC Task Force on Architectures for Enterprise Integration, http://www.cit.gu.edu.au/~bernus/taskforce/geram/versions/ geram1-6-3/v1.6.3.html 19. Nazzal, D.: Reference Architecture for Enterprise Integration. CIMOSA GRAI/GIM PERA, http://www2.isye.gatech.edu/~lfm/8851/ EIRA.ppt#264,8,CIMOSAEnterprise 20. Theories Used in IS Research Wiki, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Main_Page 21. Theory of administrative behavior, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Administrative_behavior%2C_theory_of 22. Organizational Knowledge creation theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Organizational_knowledge_creation

42

J. Makna

23. Language action perspective, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Language_action_perspective 24. Transactive memory theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Transactive_memory_theory 25. Knowledge-based theory of the firm, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Knowledge-based_theory_of_the_firm 26. Agency theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Agency_theory 27. Principal Agent Problem, http://en.wikipedia.org/wiki/ Principal-agent_problem 28. Media richness theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Media_richness_theory 29. Argumentation theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Argumentation_theory 30. A Description of Toulmin’s Layout of Argumentation, http://www.unl.edu/speech/comm109/Toulmin/layout.htm 31. Cognitive fit theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Cognitive_fit_theory 32. Transaction cost economics, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Transaction_cost_economics 33. Dynamic capabilities, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Dynamic_capabilities 34. Organizational information processing theory, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Organizational_information_processing_theory 35. The resource-based view of the firm, http://www.fsc.yorku.ca/york/istheory/wiki/index.php/ Resource-based_view_of_the_firm 36. The S-Curve and technology adoption, http://en.wikipedia.org/wiki/Diffusion_of_innovations 37. Watson, R.T., Pitt, L.F., Berthon, P.R.: Service: The Future. In: Grover, V., Kettinger, W. (eds.) Process Think: Winning Perspectives For Business Change In The Information Age. Idea Group Publishing, Hershey 38. Kien, S.S., Siong, N.B.: Reengineering Effectiveness anf the Redesign of Organisational Control: A Case Study of the Inland Revenue Authority of Singapore. In: Grover, V., Kettinger, W. (eds.) Process Think: Winning Perspectives For Business Change In The Information Age. Idea Group Publishing, Hershey 39. Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics. Interference and Relationship, vol. 2. Charles Griffin & Company limited, London

Performance Driven Database Design for Scalable Web Applications Jozsef Patvarczki, Murali Mani, and Neil Heﬀernan Worcester Polytechnic Institute, Department of Computer Science 100 Institute Road, Worcester, Massachusetts, 01609, US {patvarcz,mmani,nth}@cs.wpi.edu

Abstract. Scaling up web applications requires distribution of load across multiple application servers and across multiple database servers. Distributing load across multiple application servers is fairly straightforward; however distributing load (select and UDI queries) across multiple database servers is more complex because of the synchronization requirements for multiple copies of the data. Diﬀerent techniques have been investigated for data placement across multiple database servers, such as replication, partitioning and de-normalization. In this paper, we describe our architecture that utilizes these data placement techniques for determining the best possible layout of data. Our solution is general, and other data placement techniques can be integrated within our system. Once the data is laid out on the diﬀerent database servers, our eﬃcient query router routes the queries to the appropriate database server/(s). Our query router maintains multiple connections for a database server so that many queries are executed simultaneously on a database server, thus increasing the utilization of each database server. Our query router also implements a locking mechanism to ensure that the queries on a database server are executed in order. We have implemented our solutions in our system, that we call SIPD (System for Intelligent Placement of Data). Preliminary experimental results illustrate the signiﬁcant performance beneﬁts achievable by our system. Keywords: Scalability, Web application, database design.

1

Introduction

There are thousands of web applications, and these systems need to ﬁgure out how to scale up their performance. Web applications typically have a 3-tier architectures consisting of clients, application, and database server that are working together. Signiﬁcant work has been done in load balancers to solve the possible scalability issues and to distribute requests equally among multiple application servers. However, issues related to the increased database server usage and to distribute requests among multiple database servers have not been adequately addressed. The increasing load of the database layer can lead to slow response time, application error, and in the worst case, to diﬀerent type of system crashes. J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 43–58, 2009. c Springer-Verlag Berlin Heidelberg 2009

44

J. Patvarczki, M. Mani, and N. Heﬀernan

Our work is motivated by the ASSISTment Intelligent Tutoring system [6]. In the ASSISTment system the increasing number of sessions can be easily balanced among application servers but the continuous database read (select) queries, and update, delete and insert (UDI) queries decrease the system response time signiﬁcantly. Currently, the ASSISTment system supports 3000 users that include 50 teachers from 15 public schools over Massachusetts. It consists of multiple application servers, a load balancer, and a database server. A characteristic of web applications such as our ASSISTment system, is that we know all the incoming query templates beforehand as the users typically interact with the system through a web interface such as web forms [5]. Traditional solutions for distributing load across multiple database servers, on the other hand, do not have this property [10]. This allows us to propose additional solutions for balancing load across multiple servers in the scenario of web applications, above and beyond what is supported for traditional applications. 1.1

Current Techniques for Distributing Load across Multiple Database Servers

Several techniques are known for distributing load across multiple database servers; one of them is replication [10]. In replication, a table is placed on more than one database server. In such a case, a select query on that table can be executed by any one of the database servers that have a replica of that table. An UDI query on that table however needs to be executed on all the database servers that have a replica of that table. If we do not know all the queries that the application may need to process beforehand, then one of the database servers must hold the entire data (all the tables) of that application. Such a layout of the data is needed to answer a query that needs to access all the tables. A drawback of this technique is that every UDI query needs to be executed against the node/(s) that hold the entire data and thus these nodes become the bottleneck for performance. Such an architecture is supported by Oracle, and is referred to as a master-slave architecture. In this case, the master node holds the entire data; every UDI query is executed against the master node and propagated to slave nodes as necessary using log ﬁles. In the case of web applications, we no longer need a node that holds the entire data (assuming that none of the queries access all the data). We can therefore do a more intelligent placement of the data such that there is no node that must execute all UDI queries; thus we can remove the bottleneck node for UDI queries that is inherent in non-web applications. This improves performance of read queries while not signiﬁcantly impacting the performance of UDI queries. We discussed a simple master slave architecture above, where there is a single master node. However other master-slave architectures are possible where there is more than one master node. If there is more than one master node, there is no single-point of failure, but there is a higher synchronization (or update propagation) cost. In case of full-replication (all nodes are eﬀectively master nodes), any node can act as a master when the original master fails, and the

Performance Driven Database Design for Scalable Web Applications

45

routing of queries to the nodes is straightforward as any node can answer any query, but the updates have to propagated to all the nodes. Another technique for distributing load across multiple database servers in web applications is partitioning of data, which includes both horizontal and vertical partitioning. Horizontal partitioning splits the table up into multiple smaller tables containing the same number of columns, but fewer rows. This can speed up query performance if data needs to be accessed from only one of the partitions. However, horizontal partitioning cannot be done in all circumstances, if we want a query to be answered by one of the nodes (a good assumption for such systems). For instance, if there are two queries in the workload that access the same table, one which selects based on a column say C1, and another which selects based on a column C2, then if we do horizontal partitioning based on the values in C1, then this partitioning cannot be used to answer queries based on C2. Vertical partitioning splits the table into smaller ones with the same number of rows but fewer columns. It is a reasonable approach when the system does not want to combine the records between the partitions. However, just like horizontal partitioning, vertical partitioning cannot be done in all scenarios also. For instance suppose the query workload consists of a select query on a table, and there is also an insert statement which inserts values into all columns of the same table. In this case, performing inserts after vertical partitioning is cumbersome. Another big disadvantage for both the partitioning schemes is that the system needs to maintain the partitions and balance the amount of data with a built in application logic. De-normalization [13] can optimize the performance of database systems as well. In de-normalization, one moves from higher to lower normal forms in the database modeling and add redundant data. The performance improvement is achieved because some joins are already pre-computed. However there are disadvantages, for instance handling UDI queries are cumbersome when performed against de-normalized data, as we need to synchronize between duplicates. 1.2

Proposed Solution

In this paper, we propose a generic architecture for balancing load across multiple database servers for web applications. There are two core parts of our system: (a) the data placement algorithm produces a layout structure, describing how the data needs to be laid out across multiple database servers for best possible performance and performs the actual layout; this algorithm is independent of the placement techniques considered, and (b) the query router that utilizes the layout structure produced by the data placement algorithm for routing queries and ensuring that the database servers are being utilized eﬀectively. For determining the best possible layout structure, the data placement algorithm uses the given query workload (the percentage of queries for each template), the time it takes to execute a select/UDI query type (this time is measured as will be described in Section 4), and the time it takes to execute a select/UDI query if the table/(s) are partitioned, replicated or de-normalized (this time can

46

J. Patvarczki, M. Mani, and N. Heﬀernan

be either measured or estimated as will be described in Section 4). After determining the best possible layout structure, the data is laid out across the diﬀerent database servers, and the system is ready to start processing the incoming requests from the applications. For determining on which node/(s) a query must be executed, we have developed a query router. Our query router is eﬃcient and manages multiple connections per database server so that any database server is executing multiple queries simultaneously; this is to ensure that each database server is utilized as eﬃciently as possible. The query router also performs a simple locking mechanism to handle conﬂicting requests. We have integrated our placement algorithm and query router into our prototype system that we call SIPD (System for Intelligent Placement of Data). Our system is quite general: it can be used by any web application, and new placement techniques can be integrated as needed. 1.3

Contributions

Our contributions in this paper include the following: – We propose a data placement algorithm that is general. Our placement algorithm considers the given query workload (consisting of select and UDI queries) and the time for each query and determines the best possible placement of data across multiple database server nodes. Our placement algorithm is general as other techniques for placement can be integrated into our algorithm. – We propose an eﬃcient distributed query router architecture that routes the queries to the diﬀerent database servers, while ensuring that all the database servers are utilized eﬃciently. For ensuring that each database server is utilized eﬃciently, our query router maintains multiple connections to each database server; thus any database server is executing multiple queries simultaneously. – We have integrated our data placement algorithm and our query router into a prototype system that we call SIPD (System for Intelligent Placement of Data). Our system is general in that it can be used by any web application. – We have performed initial performance evaluation of our system. As an illustration, we describe the performance beneﬁts observed by one of the placement techniques: horizontal partitioning. We also illustrate the overall performance beneﬁt for a web application, the ASSISTment system. Outline of the Paper: In Section 2 we deﬁne the data placement problem. Our solutions for data placement, and for routing the queries is described in Section 3. Our prototype system implementation (SIPD) is discussed in Section 4. Experimental results are discussed in Section 5, and in Section 6 we discuss other aspects for improving performance. Section 7 describes related work; Section 8 concludes the work and discusses future directions.

Performance Driven Database Design for Scalable Web Applications

2

47

The Data Placement Problem

Our general architecture for a web-based application is shown in Figure 1. First, the data is placed on diﬀerent database servers. Diﬀerent clients connect and issue requests which are distributed across diﬀerent application servers by the load balancer. Balancing the load across diﬀerent application servers can be done eﬀectively by scheduling the requests using simple schemes such as round-robin, or scheduling the next request on the current least loaded server; these are not discussed further in this paper. A request may need to access data in the database server, in which case a query is issued to the query router. The query router has the logic to route the queries to the appropriate database server/(s). In short, the query router maintains the information about how the data is placed across diﬀerent database servers. Let us motivate the data placement problem using a very, thinned down schema of the ASSISTment system. The portion of the schema that we consider includes users (students), schools, user roles (that maintains the school that a user attends), problems and logged action (that maintains all the actions of every user, including logins of a user, problems that a user has attempted). We collected 16 query templates for illustration as shown in Table 1. Note that for illustration purposes, we used only simple queries that do not perform a join. This data was collected over the duration of one week from our real application, and we counted the number of queries for each template. The total number of queries for these 16 templates over the week was about 360,000. We also have shown the number of rows of each table, at the end of the week over which the data was collected. Before we describe our data placement algorithm, let us examine Table 1 closely, and study what issues the placement algorithm may have to tackle.

AS1

Global Computer Network AS2

Load Balancer

.. ASn

{ Query Router

DB1

.. .

DB2

DBm Database Servers

Application Servers

Fig. 1. General Architecture for a web application. The requests are distributed among the diﬀerent application servers by the load balancer. Requests that need to access the data are sent to the query router, that routes the query to the appropriate database server/(s).

48

J. Patvarczki, M. Mani, and N. Heﬀernan

As there are many updates against the logged action table, if logged action is replicated, the costs of performing these updates will be very high. Instead it might be better to perform a horizontal partitioning of the logged action table and place the diﬀerent partitions on the diﬀerent database server nodes. We notice that there are lot of updates against the problems table as well (ratio of UDI queries to select queries is roughly 1:14). However, Q8, Q9 and Q10 all access the problems table, but perform selects on diﬀerent columns (Q11 and Q12 use the same column as Q9). In this case, we may want to consider maintaining only one copy of the problems table (rather than replicating the table or horizontally partitioning the table). Once a table is placed on only some of the database server nodes, the load on the diﬀerent database servers may now be skewed. For instance, suppose problems table is placed on node 1, there is additional load on node 1 as compared to the other nodes. This may impact the horizontal partitioning, for instance when logged action is partitioned across nodes 1 and 2, a smaller partition may now be kept on node 1 as opposed to node 2. Let us now deﬁne the data placement problem as: we are given a query workload, that describes all the query templates for an application, and the percentage of queries of each template that the application typically processes. Determine the best possible placement of the tables on the diﬀerent database server nodes. One can optimize based on diﬀerent criteria: for instance, we can minimize response time, maximize total throughput, minimize latency, minize the maximum load on the database servers etc.

3

Balancing the Load across Multiple Database Servers

As described before, our solution to balance the load across multiple database servers consists of two core parts: (a) the data placement algorithm that produces the best possible layout structure and distributes the tables across the multiple database servers according to this layout structure, and (b) the query router that utilizes this layout structure for routing queries while ensuring that the database servers are utilized eﬃciently. 3.1

Data Placement Solution

Given the query workload, we want to determine a possible placement, such as the one shown in Figure 3. Figure 3 shows that the users, schools and user roles tables are fully replicated across all nodes. The problems table is placed on node 1, and the logged action is horizontally partitioned uniformly across nodes 2-5. In this section, we describe our algorithm that uses a cost-based approach that given any query workload determines the best possible placement of the tables. Our data placement algorithm is shown in Figure 2. Let us examine this data placement algorithm in detail. The dataLayout is the data structure that returns the best possible placement as determined by our algorithm. First, a pair (described in Step 3) consists

Performance Driven Database Design for Scalable Web Applications

49

Step0. Determine the cost for each query template by running each template on a lightly loaded node. Step1. Initialize an array dataLayout that maintains the current data placed on each database server. The initial data on each database server is set to empty. Step2. Initialize an array, currLoad that maintains the current load on each database server. The initial load for each database server is set to 0. Step3. For each pair, initialize setOfOptions to all possible options. // for instance setOfOptions = {replication, horizontal partition, vertical partition, de-normalization} Step4. For every query in template, remove invalid options from the setOfOptions. // vertical partition and denormalization are invalid if there is an update on the table Step5. Sort the query templates according to the cost, from the most expensive to the least expensive. Step6. Iterate through the sorted list of query templates in a greedy fashion, and for each query template, Step 6.1. “Search” for the best possible placement for every table in the query. Step 6.2. Update the dataLayout array to indicate the data on each database server after this placement. Step 6.3. Update the currLoad array to indicate the load on each database server after this placement. // The currLoad array will reﬂect the cost for updates on these tables as well. Step7. Layout the tables across the diﬀerent database servers according to the dataLayout array.

Fig. 2. Data Placement Algorithm. The dataLayout array returns the best possible layout of the tables across the diﬀerent database servers.

of the table that is accessed by the template. For instance, for Q1 in Table 1, we consider , whereas for Q4, we consider . For a join query, say Qi that joins tables T1, T2, we consider and . Also, the set of options described in Steps 3 and 4 can be modiﬁed based on what options are suitable for a speciﬁc application. One could perform an exhaustive search for determining the best possible placement of the tables, but such an exhaustive search would be exponential in both the number of query templates as well as the number of nodes, which is not reasonable. Therefore our solution uses a greedy algorithm, considering the most expensive query ﬁrst. This ensures that the algorithm is polynomial in the number of query templates. Step 6 is the crux of the algorithm. Step 6.1 searches for the best possible placement of the tables for a speciﬁc query. Here again, the options considered signiﬁcantly impacts the performance of the algorithm. For instance, what different ratios of placement do we consider for horizontal partitioning; on which

50

J. Patvarczki, M. Mani, and N. Heﬀernan

Table 1. Example illustrating Query Templates and Workload. # of rows denotes the number of rows in the tables accessed by the query. Query Template Table name % of queries # of rows 1 SELECT * FROM schools schools < 1% 321 WHERE school.id=? 2 SELECT * FROM schools schools < 1% WHERE schools.name=? 3 SELECT * FROM schools schools < 1% 4 SELECT * FROM users users 19% 30826 WHERE users.id=? 5 SELECT * FROM users users < 1% WHERE users.login=? 6 UPDATE users WHERE users < 1% users.id=? 7 INSERT INTO users users < 1% 8 SELECT * FROM problems problems 13% 20566 WHERE problem.assignment id=? 9 SELECT * FROM problems problems 15% WHERE problems.id=? 10 SELECT * FROM problems problems < 1% WHERE problems.scaﬀold id=? 11 UPDATE problems WHERE problems 1% problems.id=? 12 DELETE problems WHERE problems 1% problems.id=? 13 SELECT * FROM user roles user roles 19% 42248 WHERE user roles.id=? 14 INSERT INTO user roles user roles < 1% 15 UPDATE logged action WHERE logged action 16% 7274174 logged action.user id=? 16 INSERT INTO logged action logged action 16%

database servers do we replicate a table? If we consider k options for placement of a table per database server, the number of options to be considered is k n (exponential in the number of database servers). In our implementation (discussed in Section 4), we decrease the number of options considered by several means. For instance, for horizontal partitioning of a table, we consider only one option: partition the table based on the currLoad on the diﬀerent database servers. This ensures that our algorithm is polynomial in the number of nodes as well. Once the layout of the tables for a query template has been determined, Step 6.3 of our placement algorithm updates the load on the diﬀerent database servers. For determining the load on the diﬀerent database servers, there are multiple options: we can actually perform the layout and empirically measure the cost, or we can estimate using other means. Step 7 performs the actual best possible layout of the data across the database servers. For the example in Table 1, our placement algorithm determined that the ﬁnal best possible placement is as shown in Figure 3.

Performance Driven Database Design for Scalable Web Applications Table Node

schools

users

user_roles

problems

replication

replication

replication

placement

replication

replication

replication

horizontal partition Ratio: 25%

replication

replication

replication

horizontal partition Ratio: 25%

replication

replication

replication

horizontal partition Ratio: 25%

replication

replication

replication

horizontal partition Ratio: 25%

51

logged_action

node1

node2

node3

node4

node5

Fig. 3. Optimum Placement as Determined by our Intelligent Placement Algorithm for the Query Workload in Table 1

3.2

Routing the Queries

After the data is laid out across the diﬀerent database servers, the system is ready to start processing the queries. The query router will route the queries to the appropriate database server/(s): a select query is sent to the appropriate database server, and an UDI query is sent to all the appropriate database servers. For performing the routing, the query router utilizes the dataLayout that is returned by the data placement algorithm. In addition to routing the queries correctly, the query router must also ensure that the database servers are utilized eﬀectively. For this, we need to be executing multiple queries on any database server at any instant, while also maintaining the correct in-order semantics speciﬁed by the application. Our soultion includes an eﬃcient query router that maintains multiple connections for each database server, thus enabling multiple concurrent queries on a database server. When multiple queries are executed on a single database server concurrently, we need to implement a locking mechanism to ensure the correct in-order semantics. Relying on the locking mechanism available at a database server is not suﬃcient. The locking mechanism provided by the query router must ensure that when there are two UDI queries against the same table, the two updates are performed in the order in which the requests arrived at the query router (similarly for an UDI query and a select query). Our implementation includes a simple locking mechanism for handling conﬂicting queries as will be described in detail in Section 4.

4

System Implementation

In this section, we describe our SIPD (System for Intelligent Placement of Data) implementation. We describe the choices made for our data placement algorithm, and the details of our query router implementation. Our implementation is based on the Python (http://www.python.org) language and uses the PostgreSQL (http://www.postgresql.org) database server.

52

4.1

J. Patvarczki, M. Mani, and N. Heﬀernan

Implementation of Data Placement

One of the ﬁrst things that we need is to determine the cost for executing a query on a node. For this, we have multiple options: we can estimate the cost (using EXPLAIN or EXPLAIN ANALYZE), or we can empirically measure the cost. The technique that we use for determining the cost is orthogonal to the rest of our solution. For our implementation, we follow the approach described in [5], where the authors observed that the costs are more accurately determined by executing the queries on a lightly loaded node and measuring the cost. We have implemented a simpliﬁed version of the data placement algorithm mentioned in Figure 2, where a table is either horizontally partitioned, fully replicated, or placed on exactly one node. Once the best possible placement is determined, the tables are actually laid out onto the diﬀerent database servers. For performing the placement, suppose a table is determined to be partitioned across k nodes (say node 1 through node k) based on column c, and pi percentage of data must be placed on node i. We partition the data using the values of the c column using a hash function that results in 100 buckets. On each database server node, we place the appropriate range of buckets. Note that this may result in some skewness in the data placement, and the placement may not exactly obey the percentages determined as optimum; however, if we choose a good hash function, the skewness can be minimized. After the placement is done, we are now ready to process the incoming requests, as we will describe in the following section. 4.2

Query Router Implementation

To process a request, we need a query router that routes the query to the appropriate database server. Our detailed architecture for routing queries is shown in Figure 4.

T h r e a d # 1 Q u e u e # 1 k connections AS1

DB1

.

Main Thread

..

Request Queue

Thread Connector

T h r e a d # 2 Q u e u e # 2 k connections

..

ASn

..

DB2

Q u e u e # m k connections

..

DBm Thread#m

Application Servers

Database Servers

Fig. 4. Architecture of the Query Router. The query router routes the queries from the diﬀerent application servers to the appropriate database server/(s). The thread that handles the requests for a database server maintains a queue of requests that the server needs to process, multiple connections to the server for executing multiple queries concurrently, and a lock table for ensuring in-order semantics among requests to the same server.

Performance Driven Database Design for Scalable Web Applications

53

The queries from all the application servers are sent to the query router, where the requests are queued. The query router also maintains how the tables are placed on the diﬀerent database server nodes (using the dataLayout structure returned by the placement algorithm); this information is used to route a query to the appropriate database server node/(s). In our system, how to route a query is determined statically and does not vary based on the current load on the database servers. A select query is routed by the query router to one database server, whereas an update query is routed to all the appropriate database servers. For example, a query of type Q1 may be routed to node 1; a query of type Q2 may be routed to node 5; a query of type Q6 has to be routed to all the ﬁve nodes. For replicated tables when a query can be answered by more than one node, our system routes the queries in a simple round-robin fashion. This ensures that the database servers are equally loaded. Note that we have made several assumptions: all database servers are homogeneous and take the same time to execute a query; the number of hops and the bandwidth from any application sever to any database server are equal, thus guaranteeing the same network latency; if multiple database server nodes have a replica of a table, then the load across these server nodes for this table is distributed uniformly. Each database server is managed by a thread that maintains two data structures: a queue of requests it has received, and a lock table to handle conﬂicting select and UDI queries. In order to increase the performance of each database server, the thread for the database server maintains multiple connections to that server; thus multiple queries can be executed simultaneously on a single server (See Figure 4). If multiple queries can be scheduled simultaneously on a database server, we need to implement a simple locking mechanism. Let us illustrate how the locking mechanism is implemented in our system using a lock table. Consider queries of type Q4 and Q7 that are conﬂicting: Q4 reads from the users table while Q7 inserts into the users table. If there is a query of type Q4 and a query of type Q7 both waiting to be serviced in that order, they cannot be scheduled simultaneously. Rather, we have to wait for Q4 to be ﬁnished before Q7 is scheduled. We cannot let the database server handle the conﬂict management, because it will not guarantee the serial order of Q4 and Q7. Such conﬂicts are handled using the lock table as follows: ﬁrst the thread for the database server examines the current query and sees if it can obtain the appropriate locks (read/exclusive lock). If the locks are available, then the query is scheduled on one of the available connections; otherwise, it waits till the lock is available and then the query is scheduled on one of the connections. When the query is ﬁnished, the locks are updated accordingly. While a query is waiting for a lock to be available, the following queries in the thread queue are not scheduled (even though locks may be available for those queries); this is done for simplifying our architecture.

5

Experimental Results

Our ﬁrst set of experiments are aimed at illustrating the performance beneﬁts of some of the data placement techniques. In Figure 5, we show the performance

54

J. Patvarczki, M. Mani, and N. Heﬀernan

Fig. 5. Illustrating the beneﬁts of horizontal partition on select and update queries. As the size of the partition decreases, the time to execute a select or update query decreases.

beneﬁts achieved by horizontally partitioning the data for select and UDI queries. We started with a single partition having 100% of the data, then two partitions each having 50% of the data till 10 paritions each having 10% of the data. The time to execute a select query decreased as the size of the partition decreased; so does the time to execute an update query. We obtained similar numbers for other placement techniques as well (vertical paritioning and denormalization). After this initial set of experiments, we evaluated the results obtained from our intelligent placement algorithm. We compared the throughput of this best possible layout with full replication. For our tests, we ran 3262 queries in the same ratio as in the query workload described in Table 1. Our ﬁve database server nodes had the following conﬁgurations: node 1 is an Intel Pentium 4, 3 GHz machine with 2 GB RAM, running 32 bit Windows XP; nodes 2 - 5 are Intel Xeon 4 Core CPU, with 8 GB RAM running FreeBSD 7.1 i386. The database software used on all the ﬁve nodes is Postgres version 8.2.9. Our simulated application server which issued the 3262 queries was an Intel Pentium 4, 3 GHz machine with 4 GB RAM running Ubuntu 4.1.2 OS. The code for this application is written in Python version 2.5. The bandwidth between the application server and the diﬀerent database server nodes is 100 Mbps, and the number of hops from the application server to the database servers are equal. For our layout, the problems table was placed on database server node 5, and the logged action table was horizontally partitioned based on user id column equally across nodes 1 through 4. Figure 6 illustrates the total time that it took for each database server node to ﬁnish executing all the queries routed to that server by the query router. Also for full replication, it took the ﬁve nodes a total of around 180 seconds to ﬁnish executing all the 3262 queries. For the optimum placement, the ﬁve nodes ﬁnished executing all the 3262 queries in a total of around 81 seconds. Note that the database server node 5 is more heavily loaded under optimum placement because it has to execute all the queries on the problems table (Q8 - Q12). It is

Performance Driven Database Design for Scalable Web Applications

55

Fig. 6. Illustrating our Data Placement Algorithm Results. For each node, it shows the time it took to execute the query set for full replication, and for optimum placement, along with the standard deviation.

possible to schedule fewer of the queries on the fully replicated tables (Q1 - Q5) on node 5 to make all the nodes more equally loaded.

6

Discussion

The techniques described in this paper assume that the data is distributed on diﬀerent database servers in such a manner that any select query is answered by one database server. However there are signiﬁcant work that have studied scenarios without this constraint. Distributed databases and distributed query processing [10] have long studied how to process queries over data distributed across multiple nodes. However the constraint that any select query is answered by one database server is applicable to several applications, especially web applications where all the query templates are known beforehand. This constraint also greatly simpliﬁes query processing and optimization, as no data needs to be exchanged between nodes. Therefore such a system has to only determine which database server needs to execute a query, and then the optimization and execution of the query proceeds on that server as if it was a non-distributed database. Also, as we examine web applications, we see that the load across application servers can be easily balanced and the application server layer scales up easily. This is because the application server logic can be easily replicated across any number of nodes. One potential opportunity for database scalability is to pull some of the database functionality that can be easily replicated out of the database server. For instance, a selection operation that scans a set of rows and selects rows based on a ﬁlter condition can be pulled outside the database server. The selection operation can be easily replicated across multiple servers. However this comes at a cost: the database server may be able to perform the selection more eﬃciently, for instance, by building an index, whereas these options may not be available in the selection operation outside the database server. Note

56

J. Patvarczki, M. Mani, and N. Heﬀernan

that this is diﬀerent from full-ﬂedged distributed query processing where diﬀerent nodes can perform diﬀerent operations. We believe that this is a promising direction, that we plan to investigate in the future. In real systems, we encounter system crashes quite often, and these crashes also need to be handled. In this paper, we did not consider fault tolerance. Incorporating fault tolerance into the problem deﬁnition could potentially lead to interesting results. For instance, one way of formulating the problem deﬁnition with fault tolerance is to impose a constraint that every data item is present in at least two nodes. This is also a promising research direction, worth investigating in future. Another aspect of fault tolerance is how to handle if an UDI query fails on some nodes, and succeeds in other nodes. How do we detect this scenario, and also how do we remedy such an inconsistency. One can think of a distributed transaction protocol, but such distributed transactions are very heavy weight, and drastically bring down the performance of a system. We therefore need to investigate diﬀerent semantics as may be applicable for these scenarios, and which can be implemented without drastically impacting the performance of the overall system.

7

Related Work

There has been considerable amount of work for distributing load across multiple database servers. In [3,8,9,12], the authors study full replication, where all the nodes have an exact replica of the data, and where data consistency is achieved using distributed transactions that are heavy-weight. Commercial systems such as Oracle also support full replication, but UDI queries are not performed using distributed transactions; rather, the updates are performed on a master and then the update logs are propagated to the slave nodes. In [4], the application programmers can choose the data replication and distribution strategies, but choosing such strategies eﬃciently is not easy for an application programmer. Partial replication is studied in [14,5]; in [14], the replication is at a recordlevel granularity requiring a node to hold the entire database and thus being the bottleneck; in [5], the replication is at a table-level granularity, and no node needs to have the entire database. For improving the performance of database systems, de-normalization has been studied in several projects [13,7,16]. One of the main purposes of denormalization is to decrease the number of tables that must be accessed to answer a query; this is because some joins are already pre-computed during the de-normalization process. Another technique that is critical for improved performance of applications is caching [15,1]. If the results of a query are cached, it is possible for the application server to answer a query directly from the cache without accessing the database server. This can be critical if the network bandwidth between the application and database server is low. Eﬃciently maintaining the consistency of cache is studied in [15].

Performance Driven Database Design for Scalable Web Applications

8

57

Conclusions and Future Work

In this paper, we studied the problem of scalability in web applications, in speciﬁc we considered distributing load across multiple database servers. We proposed a data placement algorithm that can consider multiple data placement techniques and determine the best possible layout of tables across multiple database servers for a given query workload. For routing the queries, we have developed an eﬃcient query router; the query router routes the queries to the appropriate database server/(s). The query router maintains multiple connections for each database server to ensure that the database servers are utilized eﬃciently; also a simple locking mechanism is supported to handle conﬂicting queries. Our solutions are integrated into the SIPD (System for Intelligent Placement of Data) that we have developed. Experimental results indicate the signiﬁcant performance beneﬁts achieved by our system. There are several issues and approaches that need to be investigated for scalability of database servers. Some of the potential future directions for research include pulling some functionality out of the database server to enable easy replication of this logic, distributed query processing in general, considering fault tolerance as an application constraint, and handling inconsistencies that may result if an operation fails on some nodes and succeeds on other nodes. Also with respect to our approach, other eﬀective locking mechanisms that operate at a ﬁner granularity and that can achieve better performance need to be investigated. To increase the systmem performance, we have to investigate diﬀerent techniques to decentralize our query router and to avoid inappropriate locking mechanism with proper caching [11] or query planning [2] solutions. Evaluating our solution against many diﬀerent web applications to illustrate the beneﬁts of our approach will also be useful.

References 1. Amiri, K., Park, S., Tewari, R., Padmanabhan, S.: DBProxy: A Dynamic Data Cache for Web Applications. In: IEEE Int’l Conference on Data Engineering (ICDE), Bangalore, India (March 2003) 2. B¨ ohm, K., Mlivoncic, M., Weber, R.: Quality-aware and load-sensitive planning of image similarity queries. In: Proceedings of the 17th International Conference on Data Engineering, Washington, DC, USA, pp. 401–410 (2001) 3. Cecchet, E.: C-JDBC: A Middleware Framework for Database Clustering. IEEE Data Engineering Bulletin 27(2), 19–26 (2004) 4. Gao, L., Dahlin, M., Nayate, A., Zheng, J., Iyengar, A.: Application Speciﬁc Data Replication for Edge Services. In: Int’l World Wide Web Conf. (WWW), Budapest, Hungary (May 2003) 5. Groothuyse, T., Sivasubramanian, S., Pierre, G.: GlobeTP: Template-Based Database Replication for Scalable Web Applications. In: Int’l World Wide Web Conf. (WWW), Alberta, Canada (May 2007) 6. Heﬀernan, N.T., Turner, T.E., Lourenco, A.L.N., Macasek, M.A., Nuzzo-Jones, G., Koedinger, K.R.: The ASSISTment Builder: Towards an Analysis of Cost Eﬀectiveness of ITS creation. In: FLAIRS, Florida, USA (2006)

58

J. Patvarczki, M. Mani, and N. Heﬀernan

7. Inmon, W.H.: Information Engineering for the Practitioner: Putting Theory Into Practice. Prentice Hall, Englewood Cliﬀs (1988) 8. Kemme, B., Alonso, G.: Don’t be Lazy, be Consistent: Postgres-R, a New Way to Implement Database Replication. In: Int’l Conference on Very Large Data Bases (VLDB), Cairo, Egypt (September 2000) 9. Plattner, C., Alonso, G.: Ganymed: Scalable Replication for Transactional Web Applications. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 155–174. Springer, Heidelberg (2004) 10. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGrawHill, New York (2003) 11. R¨ ohm, U., B¨ ohm, K., Schek, H.-J.: Cache-aware query routing in a cluster of databases. In: Proceedings of the 17th International Conference on Data Engineering, Washington, DC, USA, pp. 641–650 (2001) 12. Ronstrom, M., Thalmann, L.: MySQL Cluster Architecture Overview. MySQL Teachnical White Paper (April 2004) 13. Schkolnick, M., Sorenson, P.: Denormalization: A Performance Oriented Database Design Technique. In: AICA Congress, Bologna, Italy (1980) 14. Sivasubramanian, S., Pierre, G., van Steen, M.: GlobeDB: Autonomic Data Replication for Web Applications. In: Int’l World Wide Web Conf. (WWW), Chiba, Japan (May 2005) 15. Tolia, N., Satyanarayanan, M.: Consistency-Preserving Caching of Dynamic Database Content. In: Int’l World Wide Web Conf. (WWW), Alberta, Canada (May 2007) 16. Westland, J.C.: Economic Incentives for Database Normalization. Information Processing and Management 28(5), 647–662 (1992)

Generic Entity Resolution in Relational Databases Csaba Istv´ an Sidl´ o Data Mining and Web Search Research Group, Informatics Laboratory Computer and Automation Research Institute, Hungarian Academy of Sciences Kende u, 13-17, 1111 Budapest, Hungary [email protected]

Abstract. Entity Resolution (ER) covers the problem of identifying distinct representations of real-world entities in heterogeneous databases. We consider the generic formulation of ER problems (GER) with exact outcome. In practice, input data usually resides in relational databases and can grow to huge volumes. Yet, typical solutions described in the literature employ standalone memory resident algorithms. In this paper we utilize facilities of standard, unmodiﬁed relational database management systems (RDBMS) to enhance the eﬃciency of GER algorithms. We study and revise the problem formulation, and propose practical and eﬃcient algorithms optimized for RDBMS external memory processing. We outline a real-world scenario and demonstrate the advantage of algorithms by performing experiments on insurance customer data.

1

Introduction

Entity Resolution (ER) is an important problem of data cleansing and information integration with the main goal of identifying and grouping all data elements of heterogeneous data sources that refer to the same underlying conceptual entity. Duplicated entity representations raise severe data quality issues leading to corrupted aggregations that may eventually mislead management decisions or operational processes. Several areas would proﬁt from an eﬃcient solution of ER problems. Search engines could identify and group together web pages dealing with the same entity, such as a person or a product. Web services could identify duplicated registrations. Stores or auction web sites could group together diﬀerent items of products. Various solutions of ER can be classiﬁed either as attribute or link based. Attribute-based approaches consider input data as a set of records made up of attributes, with a resolution process based on record similarities. Link-based methods handle input data as a reference graph, with nodes as entity records and edges as links between these nodes. The goal of the resolution process is to produce a resolved entity graph, where nodes are entity instances that hold

This work was supported by grants OTKA NK 72845 and NKFP-07-A2 TEXTREND.

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 59–73, 2009. c Springer-Verlag Berlin Heidelberg 2009

60

C.I. Sidl´ o

entity records together. These methods can be considered as graph clustering algorithms. The main focus of this paper is to develop eﬃcient, industry scale attributebased methods. We build on the generic ER (GER) formulation ([3]) of ER that uses general black-box match and merge functions on record pairs. The goal is to produce a merge closure of the original record set, the smallest set of records that cannot be extended by adding new merged records. Existing algorithms for GER are in-memory algorithms and keep the whole closure set in memory, therefore the scalability of these methods is limited. Although the algorithms are optimal in that the number of required match operations is kept minimal, pairwise search for matching pairs would require more eﬃcient data structures than those in existing implementations in order to scale to practical applications. In this paper we give a set based formulation of GER to enable using eﬃcient external memory data structures and algorithms. Since standard relational databases already oﬀer general, well-tuned algorithms on relations for batch processing, our algorithms will use tables as main data structures with relational operations expressed as SQL statements. Our methods will hence suit a uniform architecture with eﬃcient storage, memory management, caching, searching and indexing facilities. Our algorithms are tightly coupled to the database, a beneﬁcial property in practice since input data usually resides in a relational database. We demonstrate the advantage of our approach when huge amounts of data has to be handled. Our motivating application is client data integration of several insurance source systems by the AEGON Hungary Insurance Ltd.1 The ER problem comes into sight during the construction of a client data mart over legacy systems that remained independent of each other for operational reasons during mergers and ownership changes. Data integration begins with cleaning and loading data into a uniﬁed schema by massive ETL tools. Then a slowly-changing, versioned client dimension is built up that includes all available attributes, with additional fact tables providing relations between clients and other dimensions such as contract or postal address. Despite the exhaustive pre-processing, several duplicates remained due to diﬀerent attribute sets in the source systems, diﬀerent data recording and storage policies as well as variation of the attributes over time. The AEGON data mart used in our experiments has tens of millions of source records, which makes the use of in-memory algorithms diﬃcult. However, the Generic ER approach seems adequate for the requirements: domain experts deﬁne exact rules on client attributes for constructing match and merge functions of client records. Merging is required: a simple record has to be produced containing as much information of the underlying matching records as possible. Finally, an automated ER process with exact results has to be produced that can be used for data mart updates. We believe that similar tasks and 1

The AEGON Hungary has been a member of the AEGON Group since 1992, one of the world’s largest life insurance and pension groups, and a strong provider of investment products.

Generic Entity Resolution in Relational Databases

61

requirements commonly appear in practice and require the revision of existing GER formulations.

2

Generic ER

Next we examine the GER model of [3] brieﬂy, including the match and merge (partial) relations for matching entities and the domination relation that, given a matching pair, points to the entity that contains more information, newer or better quality data. Let us assume a set of records I = {r1 , r2 , ...rn } ⊂ R, which we call an instance (R is a domain of records). Note that records are arbitrary elements, and do not necessary share the same structure. A match function is an R × R → {true, f alse} Boolean function, denoted as r1 ≈ r2 and r1 ≈ r2 . The merge : R × R → R partial function is deﬁned on matching pairs of records, denoted as r1 , r2 (for every r1 ≈ r2 ). Finally we deﬁne a partial order on records, called domination: r1 r2 for every r1 ≈ r2 , if r2 gives a higher quality description of the underlying entity. Given an instance I, let the merge closure of I be a set of records that can be reached by recursively adding merged matching records to I. ER(I) denotes the resolved entity set : Let ER(I) be the smallest subset of the merge closure that could only be extended by records that are dominated by other records within ER(I). Up to this point ER(I) is a well-deﬁned, but not necessary ﬁnite set (for example the merge function concatenating string records and r1 r2 meaning r2 is longer than r1 ). However, if we restrict the class of merge and match functions, then we can make ER(I) ﬁnite and independent of the record processing order. In [3] the following so-called ICAR (idempotent, commutative, associative and representative) properties are required: – idempotence: ∀ r : r ≈ r and r, r = r, – commutativity: ∀ r1 , r2 : r1 ≈ r2 ⇔ r2 ≈ r1 , and r1 ≈ r2 ⇒ r1 , r2 = r2 , r1 , – associativity: ∀ r1 , r2 , r3 where r1 , r2 , r3 and r1 , r2 , r3 exists: r1 , r2 , r3 = r1 , r2 , r3 , – representativity: ∀ r4 , r1 ≈ r4 : r3 = r1 , r2 ⇒ r3 ≈ r4 . If we use functions corresponding to the ICAR properties above, then we can use a natural partial ordering called merge domination: r1 is merge dominated by r2 , if r1 ≈ r2 and r1 , r2 = r2 . ICAR properties and merge domination reduce the computational complexity of the problem. In most cases domain knowledge can be translated to functions according to ICAR and merge domination. 2.1

Swoosh Algorithms

G-Swoosh and R-Swoosh [3] are basic algorithms for computing ER(I). GSwoosh solves the general ER problem, while R-Swoosh assumes ICAR properties and merge domination. Both algorithms are optimal in the sense of the required pairwise match operations.

62

C.I. Sidl´ o

Swoosh algorithms maintain two sets: I is the set of records which have to be processed, and I is the set of records which form the closure of the previously processed elements. G-Swoosh gets an element from I, matches against all elements of I , and adds the merged element to I. At the end of each round the selected element is moved to I . G-Swoosh eliminates dominated records after producing the whole closure. R-Swoosh enhances the process by dropping source tuples right after merging, which makes it unnecessary to eliminate dominated records at the end; besides, it keeps the size of I smaller. F-Swoosh [3] is the most eﬃcient Swoosh algorithm, an extension of R-Swoosh, deﬁning features on attributes to support matching and maintaining index-like structures to speed up searching for a matching pair.

3

Database GER Algorithms

The aforesaid data model is too general for RDBMS-based implementations: we would only like to deal with uniform relational instances. Let A1 , A2 ...An be attributes, and let a relational instance be Ir ⊆ ×ni=1 DOM (Ai ) = Rr . ER(Ir ) is also a relational instance. The relational instance is less general than the original concept, but still practical and ﬂexible enough. We are going to use such instances, and tuples (records) of these instances, denoted as t ∈ Ir . We can adapt Swoosh algorithms to RDBMS environment using tables for I and I . Since data modiﬁcation languages and APIs built around standard SQL do not enable implementing general algorithms, we have to use an embedding language. The implementation itself can be a standalone unit implemented using any programming language able to connect to relational databases, or it can be an embedded stored procedure. However, the space and time consuming operations can be formalized using SQL, which makes the role of the embedding language insigniﬁcant. 3.1

Relational GER

Pairwise match functions on relations can be expressed as ﬁltering operations in the where clause of SQL queries. Next we will give examples arising in the insurance industry. We are dealing with identities, with match functions such as “two identities cover the same person, if they have the same tax number or social security number, or if the birth date and birth name attributes are both equal”. For example we can ﬁnd matching pairs in R-Swoosh in the following way (supposing that t is an arbitrary record): select I .* from I where ( t.birth name id = I .birth name id and t.birth date = I .birth date ) or t.tax number = I .tax number or t.ss number = I .ss number

Generic Entity Resolution in Relational Databases

63

Merging two records can be expressed using functions and operators applied to the result set, in the select list of a query. The next example depicts a merge of t and t , using functions of the SQL-92 [1] speciﬁcation: select coalesce(t.birth date, t .birth date) as birth date, ( case when length(t.name) ≥ length(t .name) then t.name else t .name end) as name, ... Regulations of our current SQL environment give a new set of constraints on expressing match and merge functions, as SQL is not a Turing-complete language (although using UDFs adds some more versatility). These new constraints are orthogonal to the ICAR properties: we can easily implement functions violating ICAR. As a simple example, the SQL merge expression “t.premium + t .premium as premium” violates ICAR. RDBMS allows us to carry out batched operations on relations eﬃciently. Next we re-deﬁne match and merge to ﬁt the relational environment better. Let the relational match function be matchr : Rr × 2Rr → 2Rr , where 2Rr is the power set of Rr , the set of Ir instances. The matchr function compares a single record to an instance. Let the relational merge function be the merger : 2Rr → Rr partial function that is deﬁned on instances, whose tuples match a single arbitrary tuple. The relational merge closure of an Ir relational instance is then deﬁned as the smallest Ir subset of Ir , which satisﬁes ∀ S ⊆ Ir , ∀ t ∈ Ir : merger (matchr (t, S)) ⊆ Ir . Applying merges on the closure does not lead us out of the closure. The deﬁnition of domination stays the same as by the general model. The relational entity resolution of an Ir instance, denoted as RER(Ir ), is deﬁned as the smallest subset of the relational merge closure that does not contain dominated records. We can derive the semantics of the new functions deﬁned on tuple sets from the pairwise functions: the new match function should produce the set of all matching tuples of Ir . However, pairwise merge semantics can not always be easily translated to the new form. If we deal with ICAR pairwise functions, the semantics of the corresponding set-styled merge can be understood as applying pairwise merges in some arbitrary order to the original tuple. We can assume that matchr and merger are derived from pairwise functions having the ICAR properties the following way: matchr (t, Ir ) = {t ∈ Ir |t ≈ t } merger (Ir ) = t ∈ R, where Ir = {r1 , . . . rn }, t = . . . r1 , r2 , r3 . . . rn . We can use the merge domination for relational instances if match and merge functions can be derived from ICAR pairwise functions.

64

C.I. Sidl´ o

Algorithm 1. DB-G-GER input: I output: I = RER(I) 1: 2: 3: 4: 5: 6: 7: 8: 9:

I ← ∅ for all t ∈ I do add t to I merged ← merger (matchr (t, I )) if merged = t then add merged to I end if end for remove dominated elements from I

Now, instead of derived merger , we deﬁne a more general function class. We consider the relational match and merge functions, only if matchr can be derived from a pairwise function: matchr (t, Ir ) = {t ∈ Ir |t ≈ t }, and for all t, t tuples and I1 , I2 ⊆ Ir instances, properties t ≈ t ⇒ t ≈ t, t ≈ t, t = merger (matchr (t, Ir )) ⇒ merger (matchr (t, Ir ∪ {t })) = t , if exists: merger (I1 ∪ I2 ) = merger (merger (I1 ) ∪ merger (I2 ))

(1)

hold (a sort of idempotency and associativy). The properties reduce the complexity of computing RER(Ir ), guarantee that RER(Ir ) is ﬁnite, and the construction does not depend on the order of operations. In practice most of the useful functions can be formulated to meet these criteria. SQL implementation of matchr is parallel with pairwise match functions. When implementing merger functions we would like to formalize the semantics in a single select clause. We use grouping selects to collect matching records, and aggregate functions to implement semantics. For example a simple merge function that chooses an arbitrary not-null value can be formalized as follows: select max(birth date) as birth date, max(birth name) as birth name, . . . Aggregate functions of our preferred RDBMS can limit the choice of possible set-style merge functions. Windowing analytic aggregate functions of Oracle or other interesting extensions of SQL-92 aggregate functions in other RDBMSs may give us suﬃcient versatility. We can express complex merge functions such as “the longest name’s id” or “the passport id that occurs most often”. 3.2

DB-GER Algorithms

DB-G-GER algorithm (Alg. 1) computes RER(Ir ) when all the properties of (1) except for merge domination hold. DB-G-GER iterates through the input relational instance I, and maintains an instance I with the previously processed

Generic Entity Resolution in Relational Databases

65

Algorithm 2. DB-GER input: I output: I = RER(I) 1: I ← ∅ 2: for all t ∈ I do 3: add t to I 4: merged ← merger (matchr (t, I )) 5: if merged = t then 6: remove matchr (t, I ) from I 7: add merged to I 8: end if 9: end for

and merged elements. In every iteration step I is the resolved entity set of the previously processed elements. The main step is line 4, which can be expressed as a single SQL statement using aggregate functions, as the next example shows: select count (*), max(birth name), max(birth date), . . . from I where ( t.birth name id = I .birth name id and t.birth date = I .birth date ) or t.tax number = I .tax number or t.ss number = I .ss number Since t is already in I , we merge at least one tuple. If the merge query groups only one tuple together, we can be sure that in line 5 the merged element is the same as t: this follows by the properties of (1). We do not presume merge domination, therefore we have to eliminate dominated records in a separate step (line 9). We can build up a batched SQL statement to select dominated records in the following fashion: select i2 .* from I as i1 , I as i2 where i1 .rowid = i2 .rowid and i1 .tax number = i2 .tax number or . . . ( case when i1 .birth date is null then 0 else 1 end ) + . . . < ( case when i2 .birth date is null then 0 else 1 end ) + . . . or . . . Here we formalized a simple domination relation: a tuple dominates another matching tuple if it contains more non-null attributes. The next algorithm, DB-GER (Alg. 2) presumes merge domination. It eliminates dominated records right after merging, therefore shrinks I in every round. Line 6 can be implemented on relations as follows: delete from I where

i1 .tax number = i2 .tax number or . . .

Booth DB-G-GER and DB-GER produce RER(Ir ), and can be implemented using eﬃcient batched database operations.

66

3.3

C.I. Sidl´ o

Strong Merge Domination

Merge domination is a useful construct for reducing the size of RER(I), while retaining all the information in RER(I). Yet, ICAR properties of pairwise functions are sometimes too strict in practice. Consider the next example: a match function of identities uses conditions based on a tax number equality subcondition and a combined sub-condition of birth name, current name and birth date attributes. We would like to implement a merge function that collects the more accurate birth date, the longest name and one of the tax numbers if more tax numbers are present. If we collect and merge matching tuples of a given record, the merged tuple can be a new one that does not match the original one: we overwrite the matching features. We deﬁne a new domination relation called strong merge domination that assumes only the properties of (1). The goal is to retain source records containing information needed to ﬁnd merged records. Strong merge domination deﬁnes a partial ordering of a given instance I and for tuples t1 and t2 in I: t2 is strong merge dominated by t1 if t1 ≈ t2 and merger (matchr (t1 , I \ {t2 })) = t1 . Strong merge domination enables dropping source records that are similar to the merged record instantly (but not all source records). If we use properties of (1) and strong merge domination, algorithm DB-GER (Alg. 2) have to be modiﬁed: line 6 changes to “remove matchr (merged, I ) from I ”. 3.4

Indices and Features

An advantage of using functions deﬁned on sets is that we can search for matching tuples using indices instead of going through all elements of a set and making pairwise matches. When DB-GER merges matching records in line 4, the indices suggest records that satisfy at least one part of the match criteria. If table I is sparse enough, an index and then a directed table access can be a lot less costly than a full table scan. The time cost of searching in a regular B-tree index depends on the depth of the search tree, which grows much more slowly than the number of elements. The idea of shaping features on attributes and making feature-level decisions in [3] has the same motivation as indexing. A feature is a subset of attributes, and the match criteria is a combination of feature-based conditions. Two records match if at least one feature-pair indicates matching. F-Swoosh [3], the featurelevel ER algorithm stores positive feature-comparisons in a linear space hash table. Another set is also maintained for storing features that gave only negative matches before. These structures can also be interpreted as indices. Available types of indices are RDBMS-dependent. Besides the basic B-tree variants we may use bitmap, spatial (GIS), multimedia indices or indices for text similarity search. Multidimensional indices such as general R-trees can be very useful.

Generic Entity Resolution in Relational Databases

67

We may expect major performance improvement with adequate indexing. However, greedy indexing can harm performance if index updates cost more than the search time improvement. As a basic index selection strategy we can build an index for the feature with the least selectivity. We will examine some observations related to indexing in Section 4. 3.5

Pre-filtering

In practice there may exist records that do not contain enough information to meet the match criteria. We can determine whether none of the features allows matching. For example when we use the (birth name, birth date) and tax number features, if both birth date and tax number are unknown, then it is needless to search matching tuples. It may be proﬁtable to sort out these tuples from the input, or to extend DB-GER with an extra condition in line 2. We deﬁne matchable as an Rr → {true, f alse} function, that, if t ∈ Rr , satisﬁes true if ∃ t ∈ Rr : t ≈ t , matchable(t) = f alse else. We can use the same domain knowledge as for the match function to construct matchable. 3.6

Uncertainty

GER produces exact results, yet, if a domain expert constructs a match criteria, there are hidden conﬁdences. For example two identities could describe the same person, if the birth name and birth date attributes are equal. While this rule is satisfactory in in most cases, corrupted records can still emerge after preprocessing. There may be exceptional cases that we do not handle and these kinds of errors cannot be eliminated perfectly. Models can be built with conﬁdences on records as in [18], leading to a computationally harder problem. But we can also beneﬁt from dealing with probabilities. We can construct conditions that match records according to a probability threshold, and we can make preliminary statistics of how a match function performs. Common RDBMSs provide us useful attribute types and indices supporting probability feature matches. For example, in PostgreSQL we can build GIS indices on geospatial locations. We can then eﬃciently evaluate match conditions such as “two buildings can be considered the same if the distance of their central point are in a range of 10 meters”. Supposing that b1 and b2 are such location attributes, the match condition can be expressed as b1 && Expand(b2 , 10) and distance sphere(b1 , b2 ) < 10. Here the && operator pre-ﬁlters the result based on an eﬃcient GIS index. Other important examples of uncertain conditions with thresholds are string similarity searches such as matching very similar names. Most of the RDMBSs support string similarity searches with indices.

68

C.I. Sidl´ o

Approximate results in the insurance scenario can also be used to identify households or company hierarchies. We would like to ﬁnd entities not explicitly present in the source data, but GER algorithms can still be applied easily. 3.7

Incremental Processing

The agglomerative style of R-Swoosh and DB-GER algorithms ﬁts to the regular data warehouse refreshment policies. We can build an agglomerative delta-load process where only new records are processed in every refreshment cycle. I always contains RER(I) of the preceding records. This way we do not have to face huge data volumes in every refreshment round. As a special case, on-line event-driven refresh is also possible. 3.8

Mapping Source and Resolved Records

We would often like to store all input records and deﬁne the mapping between source and resolved records. For example after preprocessing we may store all source client records without merging as client versions. We build up RER(I) to compute exact aggregations, or to stream back resolved information to ERP systems. The RER(I) set contains exactly one matching record for an original source record in case of ICAR and merge domination: we select the single matching record from RER(I) for the original source record. In case of strong merge domination we can have more matching tuples in RER(I) for a given tuple. To ﬁnd the dominant one we have to use all the information, we have to merge all matching tuples. The merged tuple is guaranteed to be in RER(I).

4

Experiments

All experiments were performed on a commodity PC with Intel Celeron 3.2 GHz CPU, 1 GB RAM and a 7200 RPM disk without RAID. We used Oracle 10g with data warehousing conﬁguration set up to use 400 MB SGA memory. The logic of the DB algorithms was implemented in PL/SQL. We used only regular SQL functionality and regular B-tree indexes. No physical level or other special optimization was done. We implemented F-Swoosh [3] using Java 1.5, with hash set and hash table data structures from the standard library. F-Swoosh measurements were performed on a separate but identical hardware with Windows XP. Input data was not stored locally: input records were coming from the separate Oracle database, and results were written back. The execution times do not contain the cost of initial and ﬁnal data transfer. Experimental real world dataset is provided by AEGON Hungary containing approximately 12 million distinct identity records of clients. Identities contain common attributes such as name, birth name, mother’s name, sex, birth date and place, external identiﬁers such as social security number or tax number. Attributes are cleaned and uniformized using the ETL facilities of the client data mart. Preliminary data cleansing included standardization and correction

Generic Entity Resolution in Relational Databases

69

ǆĞĐƵƚŝŽŶƚŝŵĞ;ŚŽƵƌƐͿ

ϱ ϭϬ

ϰ

ϭ ϯ

Ϭ͕ϭ

Ϯ

&Ͳ^ǁŽŽƐŚ Ͳ'Ͳ^ǁŽŽƐŚ ͲZͲ^ǁŽŽƐŚ Ͳ'Ͳ'Z Ͳ'Z

Ϭ͕Ϭϭ

ϭ

Ϭ͕ϬϬϭ Ϭ͕ϬϬϬϭ

Ϭ ϭ

ϭϬ

ϭϬϬ ϭϬϬϬ /ŶƉƵƚ^ŝǌĞ;<Ϳ

ϭϬϬϬϬ

ϭ

ϭϬ

ϭϬϬ ϭϬϬϬ /ŶƉƵƚ^ŝǌĞ;<Ϳ

ϭϬϬϬϬ

Fig. 1. Scalability of the algorithms (left: time in linear scale; right: time in log scale)

of attributes joining external databases, such as ﬁrst name databases. We have chosen uniform match and merge functions veriﬁed by domain experts. We used the properties of (1) and strong merge domination. Yet, on our database only a few records conﬂicted with ICAR. We implemented G-Swoosh and R-Swoosh on relations (as DB-G-Swoosh and DB-R-Swoosh), DB-G-GER and DB-GER. Booth DB-G-Swoosh and DB-GGER employed a one-round duplicate elimination step. All algorithms used the same input and output schema. We measured execution times without the operations required to produce input data. The experiments were averaged from multiple executions in diﬀerent orders to overcome caching and other performance issues beyond our control. Fig. 1 shows execution times of the algorithms against the size of input data. Naive database implementations of G-Swoosh and R-Swoosh scale with the input data poorly, Java F-Swoosh implementation performed worst. The main cause is that DB-Swoosh variants search for matching records more eﬃciently than the original linear search, and they use batched set-styled operations. Interestingly, DB-GER and DB-G-GER, DB-G-Swoosh and DB-R-Swoosh perform similar. This means that the role of the domination is not signiﬁcant. When using instant dominated record removal, the cost of the required delete operations balances the cost of handling a larger I when eliminating dominated records at the end. The aggregated costs of duplicate elimination is depicted in Fig. 2. We also examined the impact of match selectivity on execution times. We ﬁxed the input size at 50,000 and measured execution times against merges. We can run experiments with diﬀerent match functions, but diﬀerent functions have diﬀerent evaluation times. Instead, we change the data set, and the match function stays the same: selectivity depends on the match function and both on the data set. With heuristics knowing how the match function works, we can select subsets containing more or less matching pairs. For example, we can increase the number of merges by selecting identities with birth dates of a given year. Figure 3 shows the execution times against the count of merged records

C.I. Sidl´ o

ŽŵŝŶĂƚĞĚ ĚͲĞůŝŵŝŶĂƚŝŽŶƚŝŵĞй

70

ϭϴ ϭϲ

Ͳ'Ͳ'Z

Ͳ'Z

ϭϰ ϭϮ ϭϬ ϴ ϲ ϰ Ϯ ϭ

ϭϬ

ϭϬϬ /ŶƉƵƚ^ŝǌĞ;<Ϳ

ϭϬϬϬ

ϭϬϬϬϬ

ǆĞĐƵƚŝŽŶƚŝŵĞ;ƐĞĐ͘Ϳ

Fig. 2. Percent of execution time needed to eliminate dominated records ϲϬ

ϲϬ

ϱϬ

ϱϬ

ϰϬ

ϰϬ

ϯϬ

ϯϬ

ϮϬ

ϮϬ

ϭϬ

ϭϬ

Ϭ

Ͳ'Ͳ'Z

Ͳ'Z

Ϭ Ϭ

ϮϬ ϰϬ DĞƌŐĞŽƵŶƚ;<Ϳ

ϲϬ

Ϭ

ϭϬ ϮϬ ϯϬ ŽƵŶƚŽĨůŝŵŝŶĂƚĞĚZĞĐŽƌĚƐ;<Ϳ

ǆĞĐƵƚŝŽŶƚƚŝŵĞ;ŚŽƵƌƐͿ

Fig. 3. Impact of match selectivity ϭϬ ϭ Ϭ͕ϭ Ͳ'ZͲŶŽŝŶĚĞǆ Ͳ'ZͲĨƵůůŝŶĚĞǆ Ͳ'ZͲŽƉƚŝŵŝǌĞƌ

Ϭ͕Ϭϭ Ϭ͕ϬϬϭ Ϭ͕ϬϬϬϭ ϭ

ϭϬ

ϭϬϬ /ŶƉƵƚ^ŝǌĞ;<Ϳ

ϭϬϬϬ

ϭϬϬϬϬ

Fig. 4. Performance of DB-GER with diﬀerent indexing strategies

(the count of distinct records engaged in a merge operation), and against the count of eliminated records (the diﬀerence between input and output size). DBG-GER algorithm performs better for the interval under survey, caused by the high deletion costs of dominated records in every round of DB-GER.

Generic Entity Resolution in Relational Databases

71

We measured execution times of DB-GER with diﬀerent indexing strategies (Fig. 4). Without indices we do not have to maintain additional structures, but we have to perform full-table scans. The other two variants utilized standard indices over the features. The ‘fullindex’ version was ordered to always use indices, the ‘optimizer’ version relies on the query optimizer to select an appropriate plan. The overall space cost of the indices (note that some feature indices could be omitted) were about 1.9 - 2.0 times the size of the table, with a composite index being the largest. This is a signiﬁcant space cost, yet maintaining these indices may be a good tradeoﬀ. The version without indices outperforms DB-Swoosh and F-Swoosh variations because of the new set-styled batched operations.

5

Related Work

Entity resolution problems have been studied in many diﬀerent disciplines and names such as deduplication, record linkage, coreference resolution, merge/purge, duplicate record detection etc. The traditional approach uses similarity measures for attributes, and learns when two records can be resolved to the same entity. A survey of string similarity functions can be found in [10], along with a survey of basic duplicate detection algorithms. [13] presents a nice solution for implementing approximate string joins using q-grams in RDBMS environment. If we have training data, statistical or supervised learning methods can be applied, for example Bayes methods [15,11], decision trees [17] or SVM [7,9]. Unsupervised learning methods such as latent Dirichlet allocation [4] or clustering methods can be used. An interesting approach lying between the previous two is called active learning; we have some small set of training data, and the algorithm decides the points it could use the best to extend the training set ([19]). An automated training data selection method is described in [16]. The generic ER approach was formalized and solved with Swoosh-variations in [3]. The model and the algorithms are extended in [18] for handling approximate results as records with conﬁdences. [2] adapts the algorithms to a distributed environment. ER is formalized many times as generating clusters of linked records. In the citation database problem where the goal is to identify authors we do not really have author attributes other than their name. We can however link these records by joint publications. This way ER can be seen as a special problem of linkmining; a survey containing link based entity resolution can be found in [12]. The approach is called relational ER, based on the relations between records, or collective ER, because we would like to resolve records based on the link graph as a whole. Other interesting approaches to ER includes utilizing aggregate constraints [8], or giving methods for query time ER [5]. More recently, [6] suggests a uniﬁed model for entity identiﬁcation and document categorization. [20] widens the coreference problem with schema matching and canonicalization, and provides a uniﬁed model. The role of cross-ﬁeld dependencies is described in detail in [14].

72

6

C.I. Sidl´ o

Conclusion and Future Work

Based on the generic ER formulation we developed new, practical postulates, enabling our DB-GER variations to perform signiﬁcantly better than previous Swoosh algorithms. DB-GER also proved to be useful in practice. We did not examine whether we can express matching and merging of two sets of records. The main reason is that it could be expressed as a join between I and I , but only very simple, and thus fairly purposeless functions can be used that way, since these functions have to satisfy a property set stricter than ICAR. We think that standard SQL is ﬂexible enough to build practical match and merge functions. The formal capture of the SQL match and merge function class remains however unclear. Our model enables the construction of match functions utilizing linkage information. Use of links is usually rewarding only when we deal with approximate results. The gap between probabilistic unsupervised learning methods and GER is small and in future work we plan to examine if these two separate approaches can be uniﬁed. DB-GER methods can also be parallelized as the RDBMS can provide eﬃcient parallelization options. It would be interesting to see how the GER model can be modiﬁed to meet the criteria of database parallelization.

Acknowledgments To Andr´ as Vereczky and Zolt´ an Hans as domain experts on the AEGON Hungary side. To my colleagues P´eter Neumark and Csaba P´anc´elos. To Andr´as A. Bencz´ ur and Attila Kiss for oﬀering valuable advices.

References 1. ISO-ANSI SQL-2 Database Language Standard, X3H2-92-154 (1992) 2. Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS 2007: Proceedings of the 27th International Conference on Distributed Computing Systems, Washington, DC, USA, p. 37. IEEE Computer Society, Los Alamitos (2007) 3. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009) 4. Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006) 5. Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 529–534 (2006) 6. Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identiﬁcation and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 25–33. ACM, New York (2008)

Generic Entity Resolution in Relational Databases

73

7. Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003) 8. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 437–448. ACM, New York (2007) 9. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classiﬁcation. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008) 10. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 1–16 (2007) 11. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969) 12. Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005) 13. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 491–500 (2001) 14. Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-ﬁeld dependencies. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 310–317. ACM, New York (2008) 15. Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1065–1069 (2005) 16. K¨ opcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008) 17. McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence, pp. 1050–1055 (1995) 18. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data conﬁdences. In: CleanDB Workshop, pp. 25–32 (2006) 19. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278 (2002) 20. Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A uniﬁed approach for schema matching, coreference and canonicalization. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 722–730. ACM, New York (2008)

Tool Support for the Design and Management of Spatial Context Models Nazario Cipriani1 , Matthias Wieland2 , Matthias Grossmann1, and Daniela Nicklas3 1

3

Universität Stuttgart, Institute of Parallel and Distributed Systems, Universitätsstraße 38, 70569 Stuttgart, Germany {cipriani,grossmann}@ipvs.uni-stuttgart.de 2 Universität Stuttgart, Institute of Architecture of Application Systems, Universitätsstraße 38, 70569 Stuttgart, Germany [email protected] Carl von Ossietzky Universität Oldenburg, Department for Computer Science, Escherweg 2, 26121 Oldenburg, Germany [email protected]

Abstract. A central task in the development of context-aware applications is the modeling and management of complex context information. In this paper, we present the NexusEditor, which eases this task by providing a graphical user interface to design schemas for spatial context models, interactively create queries, send them to a server and visualize the results. One main contribution is to show how schema awareness can improve such a tool: the NexusEditor dynamically parses the underlying data model and provides additional syntactic checks, semantic checks, and short-cuts based on the schema information. Furthermore, the tool helps to design new schema definitions based on the existing ones, which is crucial for an iterative and user-centric development of context-aware applications. Finally, it provides interfaces to existing information spaces and visualization tools for spatial data like GoogleEarth. Keywords: Data modeling and database design, Advanced database applications, XML and databases, Personalization in databases and information systems.

1 Introduction In the domain of context-aware computing, applications need information about their user’s situation to adapt their presentation, actions, or computation. Examples are location-based services [1], such as tourist guides [2,3], indoor and outdoor information systems [4,5] and smart environments [6], which present spatially selected information and services on mobile devices, often adapted to the user’s current situations. Adaption based on context is also done in smart environments, where everyday things, enhanced with embedded sensors and computing power, interact with the inhabitants, e.g. in smart homes. Furthermore, adaption on technical layers can be done based on context information, e.g., dynamically switch to the best available wireless network within mobile communication services. To ease the development of context-aware applications, the management of context information should not be done within the application, but J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 74–87, 2009. c Springer-Verlag Berlin Heidelberg 2009

Tool Support for the Design and Management of Spatial Context Models

75

within so-called context models [7], which can also be shared between different applications to further decrease the development costs [8]. In general, context models may contain geographical data (e.g., maps), dynamic sensor data (e.g., the position of a moving object), infrastructure data (e.g., the extent and bandwidth of wireless networks), and context-referenced digital information (e.g., documents or web sites that are relevant in a certain context). The creation of such context models is a tedious task. First, we have to model the context model schema, which specifies entities relevant for the application. Secondly, we have to provide the context model data, which represents the concrete instances of specified entities. The context model data can either be static, which means the data is entered once into the system and changed very seldom. Moreover, the data can be dynamic, which means it is sensed and inserted by sensors. Finally, the context of interest for the application (higher-level context) has often to be derived by rules or probabilistic learning approaches from other context model data (so-called lower level context). To reduce the burden of obtaining and maintaining such context models, it is beneficial to share the information between applications. The Nexus project1 aims at providing shared context models in an open, federated environment: autonomous data servers host different local context models, and a federation component provides an integrated view over those context models for the applications (the Augmented World Model—AWM) [9]. For this, we developed a spatial query language (Augmented World Query Language—AWQL) and a serialization and modeling language (Augmented World Modeling Language— AWML), which are used as interchange formats both by applications and by platform components. In this paper, we show how the NexusEditor can help in creating and maintaining both the context model schemas and the context model data. We already demonstrated a first prototype of the NexusEditor as demo at the MDM08 [10]. This first version only supported the management of context information and not the design of schemas. Now, we completed the NexusEditor, which now forms an integrated tool for the whole context data handling process. For better understanding of the ideas and concepts behind the NexusEditor, we present in this paper the overall architecture and usage of NexusEditor together with all other system components (Section 4.1). Furthermore, we present the newly introduced features of the NexusEditor: support for data schema modeling and schema extension for context models based on existing schemas (Section 4.2). The NexusEditor provides a graphical user interface to maintain spatial context models, interactively create queries and send them to a server, and visualize the results. A main feature of the NexusEditor is its schema awareness: it dynamically parses the underlying data model and provides additional syntactical checks, semantical checks, and shortcuts based on the context model schema. In addition, it shows features specific to the application domain of spatial data management, and how external systems like GoogleEarth can be integrated in such an environment. Beyond the data level, the NexusEditor also supports creating, editing, and validating context model schemas. This paper is structured as follows: after a discussion of the related work in Section 2, we shortly describe the Nexus platform and the Augmented World Model in Section 3, which serves as a central integration schema for the shared context models of 1

http://www.nexus.uni-stuttgart.de

76

N. Cipriani et al.

applications. In Section 4, we illustrate the main features of the NexusEditor and how it can ease the development of context models. Finally, Section 5 concludes the paper.

2 Related Work One of the most comprehensive approaches for developing context-aware applications is based on the Context Modeling Language (CML) and its associated software engineering framework [7]. It extends Object-Role Modeling (ORM) [11]—developed for conceptual modeling of databases—by features needed for context modeling. CML provides a graphical notation designed to support the software engineer in analyzing and formally specifying the context requirements of a context-aware application. It also extends the Rmap procedure of [11] to transform a conceptual schema automatically to a relational schema to manage the context information data in a database. While it is generally possible for applications to share parts of the CML context models, the approach was not intended to scale up for a large number of different applications, so the model does not offer direct support for schema evolution and name spaces. In addition, to our knowledge, there are no integrated tools available that allow the exploration of existing context schemas and context information on distributed servers as NexusEditor does. There are many tools on the market that could be used for the development of contextaware applications. For creating the context data model a UML designer like Borland Together2 or IBM Rational Rose3 can be used. For the XML editing, e.g., for creating concrete context data objects and inserting them into the system an XML editor like XMLSpy4 could be used. Testing the functionality of the system and the defined queries could be done using a Web Services testing tool like soapUI 5 . All those tools are generic solutions and hence are not adapted to the development of spatial, context-aware applications. By using the NexusEditor, the overhead of learning the usage of different tools is avoided. Furthermore, data exchange between different tools is often difficult or not possible at all. In contrast, the NexusEditor is an integrated solution supporting all tasks described above. The user only has to use one tool that integrates all context models, data instances, and interfaces to the context data management system.

3 Nexus Platform and the Augmented World Model The goal of the Nexus platform is to support a large variety of context-aware applications, such as location based information systems [12], mobile games [13], or even factory management applications [14], by providing a shared, global context model (the Augmented World Model). To achieve this, the platform federates local models, which are stored on Context Servers. The local context models typically contain some selected context information in a spatially restricted area (the service area). Sensors keep local context models up to date. Depending on the type of context data, providers can use 2 3 4 5

http://www.borland.com/us/products/together/ http://www.ibm.com/software/awdtools/developer/rose/ http://www.altova.com/products/xmlspy/xml_editor.html http://www.soapui.org/

Tool Support for the Design and Management of Spatial Context Models

77

different server implementations, e.g. RDBMS for static data like buildings or main memory systems for position data [8]. The Augmented World Model is based on data objects that are formed by attributes. In contrast to object-oriented programming, these data objects do not have methods or behavior. Objects are instances of one or more object types that define which attributes the object must have (mandatory attributes) and which attributes it can have (optional attributes). An object can have multiple attribute instances of the same type with different values, which, in conjunction with meta data like valid time, allows, e.g., the representation of value patterns like trajectories of moving objects [15]. The name, structure and basic data type of the attributes are defined in an attribute schema. A class schema imports an attribute schema and groups these attributes to object types. The object types in a class schema form an is-a hierarchy (inheritance): if an object type B inherits from an object type A, B has all attributes of A and can add new attributes. Required attributes of A must be required in B; optional attributes can stay optional in B or be defined as required by B. To integrate context data from different sources, we defined a so-called Standard Class Schema (consisting of a class schema and a corresponding attribute schema) that contains common object types needed by most context-aware applications. Furthermore, it is possible to extend attribute schemas and class schemas. An extended attribute schema can define new attribute names and structures. As long as it uses the basic data types, the components of the Nexus platform can still process attributes that are compliant to the extended attribute schema. An extended class schema can define new object types as long as they inherit directly or transitively from object types from the base class schema. With this, the Nexus platform can transform objects compliant to the extended class schema to objects of the base schema by omitting the additional attributes (Liskovs substitution principle [16]). This allows applications to use an object of any arbitrary extended type by its type in the base schema. For accessing the Augmented World Model, a spatial query language named Augmented World Query Language (AWQL) was developed. AWQL is defined in an XMLbased syntax and allows to select object by standard operators, such as equal, lesser, greater, like, spatial operators, such as within and overlaps, or temporal operators, such as temporalBefore or temporalIntersects. The selected objects can be filtered using the Nearest Neighbor statement (which selects the k nearest objects to a given position) and by attribute filters (to select only the attributes the application is interested in). AQWL exploits the object type hierarchy: if an application queries for objects of a certain type, objects that inherit from this type are also returned. An AWQL query also includes a list of extended class schemas the application is aware of. The Nexus platforms converts objects of types not contained in this list into types known to the application by upcasting them. For serializing Augmented World objects, the Augmented World Modeling Language (AWML) was defined. It is also based on XML, which allows the flexible serialization and deserialization of data by using of existing tools and parsers. However, the modeling requirements of the Augmented World Model show the limitations of current XML technology: since Augmented World objects can have multiple object types, the type information cannot be given in the meta data (i.e., the element names in the

78

N. Cipriani et al.

XML document), but we have to use generic objects that contain the type information in the data (i.e., the attribute named type). Hence, existing XML parsers cannot validate AWML regarding the class schema but only regarding the attribute schema.

4 NexusEditor: Support for the Design of Context-Models The NexusEditor supports the design of data schemas for context-aware applications in four different phases: schema modeling, query design, data provisioning, and testing. The goal of the NexusEditor is to provide an easy and intuitive way for creating domain specific schema extensions for context-aware applications, create the context data, formulate queries, and visualize the results. Additionally, it provides export functions to existing tools, e.g., GoogleEarth. To improve the usability, it exploits information about the data schema, i.e., the object and type definitions. This helps the user to create complex and still correct queries by providing immediate type checks and syntax checks. Fig. 1 shows a screenshot of the tool: in the left part A , a hierarchical representation of a Result Set is provided, which is the result of the inquired query send to the Nexus system (queryOpera_srs.awql). A Result Set consists of many Nexus Objects, which are composed of attributes and attribute parts. Depending on the selected entry on the left side, the right part B shows type specific features of the selection, e.g., in this case the geometrical data values defining the selected attribute extent belonging to the Nexus Object of type nscs:Building with name 15, etc.

Fig. 1. Screenshot of the NexusEditor (Result Set View)

Tool Support for the Design and Management of Spatial Context Models

79

A special feature of the NexusEditor is the presentation of the polygon of geometrical attributes in a preview area C rather than just show the raw values representing the polygon D . This facilitates the identification of the geometries and helps picking up the right one. A tool such as the NexusEditor facilitates and accelerates the application development, since developers and users do not need to learn different technologies to accomplish the task. E.g., users do not need to know about XML and XML schema to create a Nexus Object. Furthermore, such tools offer instant validation and at the same time reduce sources of error. If we compare the NexusEditor interface to a fragment of the XML representation of the result set in Listing 1.1, it is obvious that the usage of a highly integrated tool, such as the NexusEditor, disburdens the user from unhandy data structures and additionally reduces the probability of error. [ ..... ] <nsas:nol> <nsas:value> nexus:http://nemesis.informatik.uni-stuttgart.de:8082/ erspase/seife||0x834bc98b6d5511d9bc4c080020a23633/0 xaee598ebb5fb11d7be0c080020a23633 <nsas:type> <nsas:value> nscs:Building <nsas:extent> <nsas:value> 48.743668,9.097413 48.743664,9.097495 48.743645,9.0974929 48.743642,9.0975469 48.743676,9.0975499 48.743672,9.097532 48.7436851,9.097527 48.7436831,9.0975141 48.743733,9.097497 48.74373,9.097482 48.743762,9.097472 48.7437701,9.097528 48.74378,9.097525 48.74379,9.097589 48.743752,9.0976031 48.743754,9.097617 48.7437111,9.0976329 48.743707,9.0976049 48.7436909,9.097611 48.743535,9.097538 48.743533,9.097585 48.7435719,9.097588 48.7435671,9.097709 48.743474,9.097701 ... some coordinates omitted ... 48.743552,9.097154 48.7435429,9.097363 48.7436049,9.0973701 48.743609,9.0972841 48.74359,9.0972821 48.743594,9.097195 48.7435739,9.097193 48.7435769,9.0971411 48.7436049,9.0971439 48.7436049,9.097153 48.743637,9.097156 48.743636,9.0971749 48.7436551,9.0971769 48.7436531,9.0972059 48.743667,9.097208 48.743664,9.0972581 48.743672,9.0972581 48.74367,9.0972899 48.7436629,9.0972899 48.7436609,9.0973199 48.743622,9.0973161 48.74362,9.097362 48.743651,9.097364 48.743649,9.09741 48.743668,9.097413

80

N. Cipriani et al.

<nsas:kind> <nsas:value> real <nsas:name> <nsas:value> 15 <nsas:pos> <nsas:value> 48.743668,9.097413 [ ..... ]

Listing 1.1. XML Fragment for Result Set in AWML Format

4.1 Architecture The embedding of the NexusEditor in the overall scenario is depicted in Fig. 2. Three main components can be identified. First, the Nexus Platform provides a global contextmodel. Secondly, the context-aware applications users can access the previously modeled and integrated context-data through the Nexus platform. And thirdly, the NexusEditor is used to integrate domain specific data within the context-model. It facilitates the process of modeling, creating and integrating the data in the Nexus platform. To do so, it provides in a first stage the possibility to evolve the existing context-model by developing so-called Extended Schemas. In a second stage, the developer is able to model the context model data and compose queries to test the created context data. This is very useful to check whether the integration process was successful. In addition, it helps on developing the context-aware application afterwards by providing an easy way to test the context queries to be integrated in the application. Actors: The Nexus expert develops, maintains and provides the Nexus platform 1 . He also develops and provides the NexusEditor 2 that facilitates the integration of domain specific data, which is then used by the Domain Experts to facilitate the integration of their own context-data 3 and development of their context-aware applications 6 . The domain expert has special knowledge regarding a certain domain. A domain expert can be a data provider who wants to integrate the data within the Nexus platform. To do so, she has to first model the extended schema for the data to be integrated in the Schema Modeling Phase 4 , and request the Nexus expert for an update of the AWM. In the data provisioning phase 5 , the domain expert can model context-data she wants to integrate and compose queries and even display the result to check whether errors occurred or not. Finally, this domain specific context-data is used by application users 7 employing previously developed context-aware applications.

Tool Support for the Design and Management of Spatial Context Models

81

Fig. 2. NexusEditor Embedding

Systems: The context-aware application is installed by the user used on her mobile device. The application helps the user by adapting its behavior to the context the user is in, e.g., a tourist guide that tells the user about the places of interest nearby. For doing so, the context-aware application has to exchange data and interact with the Nexus Platform for retrieving the context data and context events. The Nexus Platform manages the context data. For that, it has to know the context model schemas. Then, the context model data has to be inserted. Afterwards, it can answer context queries to external systems. The NexusEditor is used at development time by the domain expert to create queries for the context-aware application, and new schemes and static context data for the Nexus Platform. 4.2 NexusEditor Functions In the following, the main features the NexusEditor are explained more in detail. The NexusEditor is an integrated tool that features functions like schema-aware data creation, schema evolution and extension, query formulation and invocation, and result presentation with the built-in functions or by using advanced visualization tools like GoogleEarth. Schema Awareness: As already mentioned, the NexusEditor reads the schema of the underlying context model and therefore knows (and checks), which data objects are valid, which attributes they may have, and whether they are required or optional. When

82

N. Cipriani et al.

a user creates a new data object, the NexusEditor offers a list of available object types or the current schema. The same holds for adding or changing attributes or attribute values within objects. In addition, the correct data type of the attributes (e.g., string, polygon, or number) is checked, and appropriate editing modes are offered. For example, for inserting and changing spatial information, a graphical tool exists, which also allows to display a satellite picture as visual reference for the geometries. The NexusEditor can read both the AWM Standard Class Schema and additional extended class schemas, and distinguishes between those definitions by using the correct namespaces of the XML Schemas. Schema Modeling and Integration of Geo-Spatial Data: Extending the existing schema information using the NexusEditor is very easy. As can be seen in Fig. 3, the schema browser is activated. That function is used to browse for already existing ob ject models (classes) 1 . In this example, we browse a class schema containing infor mation about a smart factory environment. The Tool class is selected 2 . A Tool is geo-referenced and has a certain position at a certain point in time. The advantage of geo-referenced tools is that overstocking of tools can be avoided and they can be localized and thereby found easily by the workers [14]. A more detailed view (telling the arbitrary and optional attributes it has) of the selected class is provided in the bottom right corner of the GUI 3 . By simply clicking Create subclass 4 , a new class ScrewTool is created, inheriting all superclass attributes. A new tabbed window opens as can

Fig. 3. NexusEditor Class Browser

Tool Support for the Design and Management of Spatial Context Models

83

Fig. 4. NexusEditor Class Composer

be seen in Fig. 4. Now, the newly created class can be enriched with additional optional and required attributes 5 . Here, only non-conflicting attributes can be added, i.e., attributes are not already assigned to the ScrewTool class to be created or one of its super classes. This increases the productivity since potential conflicts are recognized and can be eliminated at design time. Furthermore, it is also possible to add already existing classes additionally to the list of super classes for the ScrewTool class 6 . In this case, also the attributes from these classes are inherited by the current class. Afterwards, the ScrewTool class can be assigned to an existing schema or a new one can be created 7 . Once the schema information is created, an update request to update the schema information in the Nexus Platform is generated. The schema is sent to the Nexus Expert who makes the information available within the context model. Now, the context data can be modeled by the domain expert, using the same view as the result set browsing (Fig. 1). Once modeled, the context data can directly be sent to the Nexus platform and inserted on a context server with just one click. Then the data can immediately be used by context-aware applications. Interaction with Context Servers: The Nexus platform provides specialized context server implementations for different types of data [8]. For static data, such as buildings or roads, we use a relational database server. The database schema for this server is based on the decomposed storage model approach [17]. Fig. 5 shows the

84

N. Cipriani et al.

Fig. 5. ER Diagram for the Database Schema of a Context Server

simplified ER diagram for the schema. The three entities at the top partially represent the AWM schema, the remaining entities the data. The database schema reflects some characteristic features of the AWM schema, particularly the representation of object types as attribute values. On the other hand, the database schema is independent of the concrete class schemas and attribute schemas. Consequently, deploying a new extended class or attribute schema on a server, e.g., to support a factory management application, only requires new entries in the ObjectType, AttributeType and AttributePartType tables, but no changes to the database schema. Information on the type hierarchy is required, because there are no relationships between ObjectType, AttributeType, and AttributePartType, and optional attributes cannot be represented in the database. The database schema could be extended accordingly without problems, but the corresponding checks are actually done in Java before inserting objects in the database, so representing that information in the database is not necessary. Visualizing Geo-spatial Data: To visualize geo-spatial data (e.g., the results of a context query), the NexusEditor offers two different ways: internal data visualization and external data visualization. Internal data visualizations: Once a query is created, it can be sent over a SOAP interface to a given server endpoint (e.g., a context server or a federation node in the Nexus platform). The NexusEditor first displays the response in plain XML format to the user. The response can then be parsed by the NexusEditor, which displays it in a tree view (see Fig. 1 A ). The user can now browse though this representation. When an object or attribute value is selected by the user, the right side of the window shows appropriate information about the selected part (e.g., a graphical representation), based on its type from the corresponding schema. Data objects can be added, removed, or modified in this view. Additionally, the whole result set can be viewed in a map representation that draws the extent (if available) of all objects of the result set. To visualize these objects in their spatial environment, a background picture can be loaded and mapped to the coordinates

Tool Support for the Design and Management of Spatial Context Models

85

Fig. 6. Visualization with GoogleEarth

of the result set objects. With this, the user can match a given spatial data set to a satellite image or digitalized map, and use this to correct errors or insert missing objects. External data visualization: The NexusEditor also offers a bridge to existing tools like GoogleEarth6: a predefined data set that comes from the GoogleEarth server can be augmented by user-defined data using the Keyhole Markup Language (KML). KML is a XML-derived format that allows exchanging a list of points-of-interest (so-called placemarks). These placemarks can be structured into folders and documents but normally only have a name, a description and geometric and style information. The schema for placemarks can be extended with user elements. Since GoogleEarth is used as a visualization tool and not for storing or querying information, the NexusEditor puts all attribute values in a human readable format into the description element of the placemark, so that the complete information is shown when the user selects an object in GoogleEarth. As the Nexus platform and GoogleEarth support the same basic geometric elements (points, lines, polygons, and collections of these elements), the spatial attribute values can be easily exported from AWML to KML. However, to improve the visual impression of buildings that are only stored with their outline information, all polygon elements contained as attribute values in the exported objects are given a user-defined extrusion height to appear as 3D objects. This shows how context information from the Augmented World Model can be embedded in the globally available spatial data sets from GoogleEarth, which provides a nice environment with a powerful visualization tool (see Fig. 6). 6

http://earth.google.com

86

N. Cipriani et al.

Testing with Ad-hoc Queries or Queries by Example: Basically, there are two ways to create an AWQL query: ad hoc queries (created directly by the user) and query by example (created from a given data object). Additional, queries can be loaded from a file. Ad Hoc Queries: The NexusEditor supports all features of AWQL described in Section 3. When the user inserts an operator, the tools checks for the correct reference values: for example, when inserting an overlap operator to restrict the result set to a given area (also known as window query); the reference value has to be a valid geometry (e.g., a polygon). For this, the NexusEditor again offers a graphical input tool. Query by Example: To create a query by example, the user can select or create a data object as template. The NexusEditor can then create a query that contains all attribute values of that template object as restriction parameters. The user can now easily modify this generated query by removing some of these restrictions to select objects that are similar to the template object in the remaining aspects.

5 Conclusion In this paper, we presented the NexusEditor that provides the complete spectrum of support needed for design of spatial databases and management of context models: support for schema design for newly created context models; providing support for extension of existing schemas for usage in new domains; and support the deployment of the schemas to a spatial database. After this setup procedure, the usage of the spatial database is supported by the NexusEditor by providing functions for context data modeling and inserting into the spatial database. Furthermore, NexusEditor allows testing and visualizing queries needed for development of context-aware applications. The main advantage of NexusEditor is that all support functions are integrated in one tool that helps to avoid errors and speed up the work. Additionally, the schema-awareness of the NexusEditor helps to avoid syntactic errors in the modeled context data. A first version of the NexusEditor was already demonstrated as demo on MDM08 [10]. Now, the new functions described in this paper are implemented and used often by many students and researchers in the Nexus project for development of context models and context-aware applications. Compared to more generic, conventional XML editors or modeling tools, the NexusEditor supports additional features specific to the Nexus platform, e.g., geometries can be created and edited, and multi-types can be handled. Domain expert are disburdened from directly editing XML files. In addition, the NexusEditor includes an interface to the Nexus platform, so that context queries can directly be sent to the platform and the result can be evaluated in the NexusEditor. Finally, the tool has an interface to GoogleEarth, in order to visualize Augmented World objects in a larger geographical context.

Acknowledgments This work was partially funded by the Collaborative Research Center Nexus: Spatial World Models for Mobile Context-Aware Applications (grant SFB 627).

Tool Support for the Design and Management of Spatial Context Models

87

References 1. Schilit, B.N., Adams, N.I., Want, R.: Context-aware computing applications. In: 1st IEEE Workshop on Mobile Computing Systems and Applications, pp. 85–90. IEEE Computer Society Press, Los Alamitos (1994) 2. Cheverst, K., Davies, N., Mitchell, K., Friday, A., Efstratiou, C.: Developing a context-aware electronic tourist guide: some issues and experiences. In: CHI, pp. 17–24 (2000) 3. Abowd, G.D., Atkeson, C.G., Hong, J.I., Long, S., Kooper, R., Pinkerton, M.: Cyberguide: A mobile context-aware tour guide. Wireless Networks 3(5), 421–433 (1997) 4. Conner, W.S., Krishnamurthy, L., Want, R.: Making everyday life easier using dense sensor networks. In: Abowd, G.D., Brumitt, B., Shafer, S.A. (eds.) UbiComp 2001. LNCS, vol. 2201, pp. 49–55. Springer, Heidelberg (2001) 5. Leonhardi, A., Kubach, U., Rothermel, K., Fritz, A.: Virtual information towers-a metaphor for intuitive, location-aware information access in a mobile environment. In: ISWC, pp. 15– 20 (1999) 6. Kidd, C.D., Orr, R., Abowd, G.D., Atkeson, C.G., Essa, I.A., MacIntyre, B., Mynatt, E.D., Starner, T., Newstetter, W.: The Aware Home: A living laboratory for ubiquitous computing research. In: Streitz, N.A., Siegel, J., Hartkopf, V., Konomi, S. (eds.) CoBuild 1999. LNCS, vol. 1670, pp. 191–198. Springer, Heidelberg (1999) 7. Henricksen, K., Indulska, J.: A software engineering framework for context-aware pervasive computing. In: PerCom, pp. 77–86. IEEE Computer Society, Los Alamitos (2004) 8. Großmann, M., Bauer, M., Hönle, N., Käppeler, U.P., Nicklas, D., Schwarz, T.: Efficiently managing context information for large-scale scenarios. In: PerCom, pp. 331–340. IEEE Computer Society, Los Alamitos (2005) 9. Nicklas, D., Großmann, M., Schwarz, T., Volz, S., Mitschang, B.: A model-based, open architecture for mobile, spatially aware applications. In: Jensen, C.S., Schneider, M., Seeger, B., Tsotras, V.J. (eds.) SSTD 2001. LNCS, vol. 2121, pp. 117–135. Springer, Heidelberg (2001) 10. Nicklas, D., Neumann, C.: NexusEditor: A schema-aware graphical user interface for managing spatial context models. In: Meng, X., Lei, H., Grumbach, S., Leong, H.V. (eds.) MDM, pp. 213–214. IEEE, Los Alamitos (2008) 11. Halpin, T.A.: Information modeling and relational databases: from conceptual analysis to logical design. Morgan Kaufman, San Francisco (2001) 12. Nicklas, D., Großmann, M., Schwarz, T.: NexusScout: An advanced location-based application on a distributed, open mediation platform. In: VLDB, pp. 1089–1092 (2003) 13. Nicklas, D., Hönle, N., Moltenbrey, M., Mitschang, B.: Design and implementation issues for explorative location-based applications: The NexusRallye. In: Iochpe, C., Câmara, G. (eds.) GeoInfo, INPE, pp. 167–181 (2004) 14. Westkaemper, E., Jendoubi, L., Eissele, M., Ertl, T.: Smart Factory—bridging the gap between digital planning and reality. In: Proceedings of the 38th CIRP International Seminar on Manufacturing Systems, pp. 44–44. CIRP (2005) 15. Hönle, N., Käppeler, U.P., Nicklas, D., Schwarz, T., Großmann, M.: Benefits of integrating meta data into a context model. In: PerCom Workshops, pp. 25–29. IEEE Computer Society, Los Alamitos (2005) 16. Liskov, B., Wing, J.M.: A behavioral notion of subtyping. ACM Trans. Program. Lang. Syst. 16(6), 1811–1841 (1994) 17. Copeland, G.P., Khoshafian, S.: A decomposition storage model. In: Navathe, S.B. (ed.) SIGMOD Conference, pp. 268–279. ACM Press, New York (1985)

Eﬃcient Set Similarity Joins Using Min-preﬁxes Leonardo A. Ribeiro and Theo H¨ arder AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany {ribeiro,haerder}@informatik.uni-kl.de

Abstract. Identiﬁcation of all objects in a dataset whose similarity is not less than a speciﬁed threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and veriﬁcation applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major eﬀort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the veriﬁcation phase. Our experimental ﬁndings show that this trade-oﬀ is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

1

Introduction

Similarity joins pair objects from a dataset whose similarity is not less than a speciﬁed threshold; the notion of similarity is mathematically approximated by a similarity function deﬁned on the collection of relevant features representing two objects. This is a core operation for many important application areas including data cleaning [1,5], text data support in relational databases [6], Web indexing [3], social networks [10], and information extraction [4]. Several issues make the realization of similarity joins challenging. First, the objects to be matched are commonly sparsely represented in very high dimensionstext data is a prominent example. It is well-known that indexing techniques based on data-space partitioning achieve only little improvement over a simple sequential scan at high dimensionality. Moreover, many domains involve very large datasets, therefore scalability is a prime requirement. Finally, the concept of similarity is intrinsically application-dependent. Thus, a general purpose similarity join realization has to support a variety of similarity functions [5]. Recently, set similarity joins gained popularity as a means to tackle the issues mentioned above [9,5,1,13]. The main idea behind this special class of similarity joins is to view operands as sets of features and employ a set similarity function J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 88–102, 2009. c Springer-Verlag Berlin Heidelberg 2009

Eﬃcient Set Similarity Joins Using Min-preﬁxes

89

to assess their similarity. An important property, predicates containing set similarity functions can be expressed by set overlap [9,5]. Several popular measures belong to the general class of set similarity functions, including Jaccard, Hamming, and Cosine. Moreover, even when not representing a similarity function on its own, set overlap constraints can still be used as an eﬀective ﬁlter for metric distances such as the string edit distance [6,11]. Most set similarity joins algorithms are composed of two main phases: candidate generation, which produces a set of candidate pairs, and veriﬁcation, which applies the actual similarity measure to the generated candidates and returns the correct answer. Recently, Xiao et. al [13] improved the previous state-of-theart similarity join algorithm due to Bayardo et al. [2] by pushing the overlap constraint checking into the candidate generation phase. To reduce the number of candidates even more, the authors proposed the suﬃx ﬁltering technique, where a relatively expensive operation is carried out before qualifying a pair as a candidate. As a result, the number of candidates is substantially reduced, often to the same order of magnitude of the result set size. In this paper, we propose a new index-based algorithm for set similarity joins. Our work builds upon the previous work of [2] and [13], however, we follow an opposite approach to that of [13]. Our focus is on the decrease of the computational cost of candidate generation instead of number of candidates reduction. For this, we introduce the concept of min-preﬁx, a generalization of the preﬁx ﬁltering concept [5]. Min-preﬁx allows to dynamically maintain the length of the inverted lists reduced to a minimum, and therefore the candidate generation time is drastically decreased. We address the increasing in the workload of the veriﬁcation phase, a side-eﬀect of our approach, by interrupting the computation of candidate pairs that will not meet the overlap constraint as early as possible. Finally, we improve the overlap score accumulation by avoiding the overhead of dedicated data structures. We experimentally demonstrated that our algorithm consistently outperforms previous ones for unweighted and weighted sets.

2

Preliminaries

Given a ﬁnite universe U of features and a database D of sets, where every set consists of a number features from U 1 , let sim (x1 , x2 ) be a set similarity function that maps a pair of sets x1 and x2 to a number in [0, 1]. We assume the similarity function is commutative, i.e., sim (x1 , x2 ) = sim (x2 , x1 ). Given a threshold γ, 0 ≤ γ ≤ 1, our goal is to identify all pairs (x1 , x2 ) , x1 , x2 ∈ D, which satisfy the similarity predicate sim (x1 , x2 ) ≥ γ. We focus on a general class of set similarity functions, for which the similarity predicate can be equivalently represented as a set overlap constraint of the form |x1 ∩ x2 | ≥ minoverlap (x1 , x2 ), where minoverlap (x1 , x2 ) is a function that maps the constant γ and the sizes of x1 and x2 to an overlap lower bound (overlap bound, for short). Hence, the similarity join problem is reduced to a set overlap problem, where all pairs, whose overlap is not less than minoverlap (x1 , x2 ), are returned. 1

In Sect. 5, we consider weighted sets where features have associated weights.

90

L.A. Ribeiro and T. H¨ arder

Table 1. Set similarity functions Function

Deﬁnition

minoverlap (x1 , x2 )

Jaccard

|x1 ∩ x2 | |x1 ∪ x2 |

γ (|x1 | + |x2 |) 1+γ

2 |x1 ∩ x2 | |x1 | + |x2 | |x1 ∩ x2 | |x1 | |x2 |

γ (|x1 | + |x2 |) 2 γ |x1 | |x2 |

Dice Cosine

[minsize (x) , maxsize (xc )] |x| γ |x| , γ γ |x| (2 − γ) |x| , 2−γ γ |x| 2 γ |x| , 2 γ

This set overlap formulation gives rise to several optimizations. First, it is possible to derive size bounds. Intuitively, observe that |x1 ∩ x2 | ≤ |x1 | for |x2 | ≥ |x1 |, i.e., set overlap and, therefore, similarity are trivially bounded by |x1 |. By carefully exploiting the similarity function deﬁnition, it is possible to derive tighter bounds allowing immediate pruning of candidate pairs whose sizes diﬀer enough. Table 1 shows the overlap constraint and the size bounds of the following widely-used similarity functions [9,1,8,12,13]: Jaccard, Dice, and Cosine. An important observation is that, for all similarity functions, minoverlap (x1 , x2 ) increases monotonically with one or both set sizes. Another optimization technique instigated by the set overlap abstraction is the preﬁx ﬁltering concept [5]. The idea is to derive a new overlap constraint to be applied on subsets of the operand sets. More speciﬁcally, for any two sets x1 and x2 under a same total order O, if |x1 ∩ x2 | ≥ α, the subsets consisting of the ﬁrst |x1 |−α+1 elements of x1 and the ﬁrst |x2 |−α+1 elements of x2 must share at least one element [9,5]. We refer to such subsets as preﬁx ﬁltering subsets, or simply preﬁxes, when the context is clear; further, let pref (x) denote the preﬁx of a set x, i.e., pref (x) is the subset of x containing the ﬁrst |pref (x)| elements according to the ordering O. It is easy to see that, for α = minoverlap (x1 , x2 ), the set of all pairs (x1 , x2 ) sharing a common preﬁx element is a superset of the correct result. Thus, one can identify matching candidates by examining only a fraction of the original sets. The exact preﬁx size is determined by minoverlap (x1 , x2 ), which varies according to each matching pair. Given a set x1 , a question is how to determine |pref (x1 )| such that it suﬃces to identify all x2 , such that |x1 ∩ x2 | ≥ minoverlap (x1 , x2 ). Clearly, we have to take the largest preﬁx in relation to all x2 . Because the preﬁx size varies inversely with minoverlap (x1 , x2 ), |pref (x1 )| is largest when |x2 | is smallest (recall that minoverlap (x1 , x2 ) increases monotonically with |x2 |). The smallest possible size of x2 , such that the overlap constraint can be satisﬁed, is minsize (x1 ). Let maxpref (x) denote the largest preﬁx of x; thus, |maxpref (x)| = |x| − minsize (x) + 1. A speciﬁc feature ordering can be exploited to improve performance in two ways. First, we rearrange the sets in D according to a feature frequency ordering, Of , to obtain sets ordered by increasing frequencies. The idea is to minimize the number of sets agreeing on preﬁx elements and, in turn, candidate pairs

Eﬃcient Set Similarity Joins Using Min-preﬁxes

91

Algorithm 1. The ppjoin algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13

Input: A set collection D sorted in increasing order of the set size; each set is sorted according to the total order Of ; a threshold γ Output: A set S containing all pairs (xp , xc ) such that sim (xp , xc ) ≥ γ I1 , I2 , . . . , I|U | ← ∅, S ← ∅ foreach xp ∈ D do M ← empty map from set id to (os, i, j) // os = overlap score foreach fi ∈ maxpref (xp ) do // candidate generation phase Remove all (xc , j) from If s.t. |xc | < minsize (xp ) foreach (xc , j) ∈ If do M (xc ) ← (M (xc ) .os + 1, i, j) if M (xc ) .os + min (rem (xp , i) , rem (xc , j)) < minoverlap (xp , xc ) M (xc ) .os ← −∞ // do not consider xc anymore S ← S ∪ Verify (xp , M, γ) // verification phase foreach fi ∈ midpref (xp ) do If ← If ∪ {(xp , i)} return S

by shifting lower frequency features to the preﬁx positions. Second, because Of imposes an ordering on the elements of a set x, we can use the positional information of a common feature between two sets to quickly verify whether or not there are enough remaining featuresin both sets to meet a given threshold (see [13], Lemma 1). Given a set x = f1 , . . . , f|x| , let rem (x, i) denote the number of features following the feature fi in x; thus, rem (x, i) = |x| − i. A further optimization consists of sorting the database D in increasing order of the set sizes. By exploiting this ordering, one can ensure that x1 is only matched against x2 , such that |x1 | ≤ |x2 |. As a result, the preﬁx size of x can be reduced: instead of maxpref (x), we obtain a shorter preﬁx by using minoverlap (x, x) to calculate the preﬁx size. Let midpref (x) denote the preﬁx of x for sorted input; therefore |midpref (x)| = |x| − minoverlap (x, x) + 1. We are now ready to present a “baseline” algorithm for set similarity joins. Algorithm 1 shows ppjoin [13], a state-of-the-art, index-based algorithm that comprises all optimizations previously described. The top-level loop of ppjoin scans the dataset D, where, for each set xp , a candidate generation phase delivers a set of candidates by probing the index with the feature elements of maxpref (xp ) (lines 4–9). We call the set xp , whose features are used to probe the index, a probing set ; any set xc that appears in the scanned inverted lists is a candidate set of xp . Besides the accumulated overlap score, the hash-based map M also stores the feature positional information of xp and xc (line 7). In the veriﬁcation phase, the probing set and its candidates are checked against the similarity predicate and those pairs satisfying the predicate are added to the result set (line 10); we defer details about the Verify procedure to Sect. 4.1. Finally, a pointer to set xp is appended to each inverted list If associated with the features of midpref (xp ) (lines 11–12). Note that the algorithm also indexes the

92

L.A. Ribeiro and T. H¨ arder

feature positional information, which is needed for checking the overlap bound (line 8). Additionally, the algorithm employs the lower bound of the set size to dynamically remove sets from inverted lists (line 5).

3

Min-preﬁx Concept

In this section, we ﬁrst empirically show that the number of generated candidates can be highly misleading as a measure of runtime eﬃciency. Motivated by this observation, we introduce the min-preﬁx concept and propose a new algorithm that focuses on minimizing the computational cost of candidate generation. 3.1

Candidate Reduction vs. Runtime Eﬃciency

Most set similarity join algorithms operate on shorter set representations in the candidate generation phase (e.g., preﬁxes and signatures) followed by a potentially more expensive stage where a thorough veriﬁcation is conducted on each candidate. Accordingly, previous work has primarily focused on candidates reduction where increased eﬀort is dedicated to candidate generation to achieve stronger ﬁltering eﬀectiveness. In this vein, an intuitive approach consists of moving part of the veriﬁcation into candidate generation. For example, we can generalize the preﬁx ﬁltering concept to subsets of any size: (|x| − α + c)-sized preﬁxes must share at least c features. This idea has already been used for related similarity operations, but in diﬀerent algorithmic frameworks [8,4]. Lets examine this approach applied to ppjoin. We can easily swap part of the workload between veriﬁcation and candidate generation by increasing feature indexing from midpref (x) to maxpref (x) (Alg. 1, line 11). We call this version u-ppjoin, because it exactly corresponds to a variant of ppjoin for unordered datasets. Although u-ppjoin considers more sets for candidate generation, a larger number of candidate sets are pruned by the overlap constraint (Alg. 1, lines 8–9). Figure 1 shows the results of both algorithms w.r.t. number of candidates and runtime for varying Jaccard thresholds on a 100K sample taken from the DBLP dataset (details about the datasets are given in Sect. 6). As we see in Fig. 1a, u-ppjoin indeed reduces the amount of candidates, especially for lower similarity thresholds, thereby reducing the veriﬁcation workload2. However, the run time results showed in Fig. 1b are reversed: u-ppjoin is considerably slower than ppjoin. Similar results were reported by Bayardo et al. [2] for the unordered version of their All-pairs algorithm. We also observed identical trends on several other real world datasets as well as for diﬀerent growth pattern of feature indexing. These results reveal that, at least for inverted-list-based algorithms, candidate set reduction alone is a poor measure of the overall eﬃciency. Moreover, they suggest that the trade-oﬀ of workload shift between candidate generation and veriﬁcation can be exploited in an opposite way. 2

Actually, the veriﬁcation workload is even more reduced than suggested by number of candidates. Due to the increased overlap score accumulation in the candidate generation, many more candidates are discarded at the very beginning of Verify.

Eﬃcient Set Similarity Joins Using Min-preﬁxes

93

(a) No. of candidates: Jaccard on DBLP

(b) Runtime eﬃciency: Jaccard on DBLP

Fig. 1. Number of candidates vs. runtime eﬃciency

3.2

Min-preﬁx Concept

A set xc is indexed by appending a pointer to the inverted lists associated with features fj ∈ midpref (xc ) which results in an indexed set, denoted by I (xc ); accordingly, let I (xc , fj ) denote a feature fj ∈ xc whose associated list has a pointer to xc . A list holds a reference to xc until being accessed by a probing set with size |xp | > maxsize (xc ), when this reference is eventually removed by size bound checking (Alg. 1, line 5). We call the interval between the processing of the set xc following in DB sort order and the last set with size less than or equal to maxsize (xc ) the validity window of I (xc ). Within its validity window, any appearance of I (xc ) in lists accessed by a probing set either elects I (xc ) as a new candidate, if the ﬁrst appearance thereof, or accumulates its overlap score. As previously mentioned, the exact (and minimal) size of pref (xc ) is determined by the lower bound of pairwise overlaps between xc and a reference set xp . As our key observation, the minimal size of pref (xc ) monotonically decreases along the validity window of I (xc ) due to dataset pre-sorting. Hence, as the validity window of xc is processed, an increasing number of the indexed features in midpref (xc ) no longer alone suﬃces to elect xc as a candidate. More speciﬁcally, we introduce the concept of min-preﬁx, formally stated as follows. Deﬁnition 1 (Min-preﬁx). Let xc be a set and let pref (xc ) be a preﬁx of xc . Let xp be a reference set. Then pref (xc ) is a min-preﬁx of xc relative to xp , denoted as minpref (xc , xp ), iﬀ 1 + rem (xc , j) ≥ minoverlap (xp , xc ) holds for all fj ∈ pref (xc ). When processing a probing set xp , the following fact is obvious: if xc ﬁrst appears in an inverted list associated with a feature fj ∈ / minpref (xc , xp ), then (xc , xp ) cannot meet the overlap bound. We call a feature I (xc , fj ), which is not an element of minpref (xc , xp ), a stale feature relative to xp . Example 1. Fig. 2a shows an example with an indexed set I (x1 ) of size 10 and two probing sets x2 and x3 of size 10 and 16, respectively. Given Jaccard as similarity function and a threshold of 0.6, we have midpref (x1 ) = 3, which corresponds to the number of indexed features of I (x1 ). For x2 , we have minpref (x1 , x2 ) = 3; thus, no stale features are present. On the other hand, for x3 as reference set, we have minpref (x1 , x3 ) = 1. Hence, I (x1 , f2 ) and I (x1 , f3 ) are stale features.

94

L.A. Ribeiro and T. H¨ arder a) I(x1), |x1|=10, Jaccard, ths=0.6

b) maxpref(x)

|minsize(x)|

.. .

|midpref(x1)| = |minpref(x1,x2)|=3

x1

|x2|=10

midpref(x) minoverlap(x1,x2)=7

|x3|=16

{

stale features x1

|x|

|minpref(x1,x3)|=1

minoverlap(x1,x3)=10

.. .. ..

}

I(x) |x|

minpref(x,xp)

.. . validity .. window .

|maxsize(x)|

Fig. 2. Min-preﬁx example

The relationship between the preﬁx types is shown in Fig. 2b. The three preﬁxes are minimal in diﬀerent stages of an index-based set similarity join by exploiting diﬀerent kinds of information. In the candidate generation phase, the size lower bound of a probing set x deﬁnes maxpref (x), which is used to ﬁnd candidates among (shorter) sets indexed. To index x, the database sort order allows reducing the preﬁx to midpref (x). Finally, min-preﬁx determines the minimum amount of information that needs to remain indexed to identify x as a candidate. Diﬀerently from the previous preﬁxes, minpref (x, xp ) is deﬁned in terms of a reference set xp , which corresponds to the current probing set within the validity window of x. The following lemma states important properties of stale features according to the database and the feature ordering. Lemma 1. Let D be a database of sets of features; each set is sorted according to a total order Of . Let I (xc ) be an indexed set and xp be a probing set. If a feature I (xc , fj ) is stale in relation to xp , then I (xc , fj ) is stale for any xp such that |xp | ≥ |xp |. Moreover, if I (xc , fj ) is stale, then any I (xc , fj ), such that j > j, is also stale. 3.3

The mpjoin Algorithm

Algorithm ppjoin only uses stale features for score accumulation. Candidate pairs whose ﬁrst common element is a stale feature are pruned by the overlap constraint. Because set references are only removed from lists due to size bound checking, repeated processing of stale features are likely to occur very often along the validity window of indexed sets. As strongly suggested by the results reported in Sect. 3.1, such overhead in candidate generation can have a negative impact on the overall runtime eﬃciency. Listed in Alg. 2, we now present algorithm mpjoin which builds upon the previous algorithms All-pairs and ppjoin. However, it adopts a novel strategy in the candidate generation phase. The main idea behind mpjoin is to exploit the concept of min-preﬁxes to dynamically reduce the lengths of the inverted lists to a minimum. As a result, a larger number of irrelevant candidate sets are never accessed and processing costs for inverted lists are drastically reduced. To employ min-preﬁxes in an index-based similarity join, we need to keep track of the min-preﬁx size of each indexed set in relation to the current probing

Eﬃcient Set Similarity Joins Using Min-preﬁxes

95

Algorithm 2. The mpjoin algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Input: A set collection D sorted in increasing order of the set size; each set is sorted according to the total order Of ; a threshold γ Output: A set S containing all pairs (xp , xc ) such that sim (xp , xc ) ≥ γ I1 , I2 , . . . I|U | ← ∅, S ← ∅ foreach xp ∈ D do M ← empty map from set id to (os, i, j) // os = overlap score foreach fi ∈ maxpref (xp ) do // candidate generation phase Remove all (xc , j) from If s.t. |xc | < minsize (xp ) foreach (xc , j) ∈ If do if xc .prefsize < m Remove (xc , j) from If // I (xc , j) is stale continue M (xc ) ← (M (xc ) .os + 1, i, j) if M (xc ) .os + min (rem (xp , i) , rem (xc , j)) < minoverlap (xp , xc ) M (xc ) .os ← −∞ // do not consider xc anymore if M (xc ) .os + rem (xc , j) < minoverlap (xp , xc ) Remove (xc , j) from If // I (xc , j) is stale xc .pref size ← |xc | − minoverlap (xp , xc ) + 1// update prefix size S ← S ∪ Verify (xp , M, γ) // verification phase xp .prefsize ← |midpref (x)|// set initial prefix size information foreach fi ∈ midpref (xp ) do If ← If ∪ {(xp , i)} return S

set. For this reason, we deﬁne min-preﬁx size information as an attribute of indexed sets, which is named as prefsize in the algorithm. At indexing time, prefsize is initialized with the value of midpreﬁx (line 17). Further, whenever a particular inverted list is scanned during candidate generation, prefsize of all related indexed sets is updated using the overlap bound relative to the current probing set (line 15). Stale features can be easily identiﬁed by verifying if the prefsize attribute is smaller than the feature positional information in a given indexed set. This veriﬁcation is done for each set as soon as it is encountered in a list; set references in lists associated with stale features are promptly removed and the algorithm moves to the next list element (lines 07–09). Additionally, for a given indexed set, stale features may be probed, before its prefsize is updated. Because features of an indexed set are accessed as per the feature order by a probing set (they can be accessed in any order by diﬀerent probing sets though), stale feature can only appear as a ﬁrst common element. In this case, it follows from Deﬁnition 1 that the overlap constraint cannot be met and the set reference can be removed from the list (lines 13–14). The correctness of mpjoin partially follows from Lemma 1: it can be trivially shown that the inverted-list reduction strategy of mpjoin does not lead to missing any valid result. Another important property of mpjoin is that score

96

L.A. Ribeiro and T. H¨ arder

Algorithm 3. The Verify algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Input: A probing set xp ; a map of candidate sets M ; a threshold γ Output: A set S containing all pairs (xp , xc ) such that sim (xp , xc ) ≥ γ S←∅ foreach xc ∈ M s.t. (overlap ← M (xc ) .os) = −∞ do if (fc ← featureAt (xc , xc .prefpos)) < (fp ← featureAt (xp , maxpref (x))) fp ← featureAt (xp , M (xc ) .i + 1) , fc ++ else fc ← featureAt (xc ) M (xc ) .j + 1, fp ++ while fp = end and fc = end do // merge-join-based overlap calc. if fp = fc then overlap++, fp ++, fc ++ else if rem (min (fp , fc )) + overlap < minoverlap (xp , xc ) then break min (fp , fc ) ++ // advance cursor of lesser feature if overlap ≥ minoverlap (xp , xc ) S ← S ∪ {(xp , xc )} return S

accumulation is done exclusively on min-preﬁx elements. This property ensures the correctness of the Verify procedure, which is described in the next section.

4 4.1

Further Optimizations Veriﬁcation Phase

A side-eﬀect of the index-minimization strategy is the growth of candidate sets. Besides that, as overlap score accumulation is performed only on min-preﬁxes, larger subsets have to be examined to calculate the complete overlap score. Thus, high performance is a crucial demand for the veriﬁcation phase. In [13], feature positional information is used to leverage prior overlap accumulation during the candidate generation. We can further optimize the overlap calculation by exploiting the feature order to design a merge-join-based algorithm and the overlap bound to deﬁne break conditions. In Alg. 3, we present the algorithm corresponding to the Verify procedure of mpjoin, which applies the optimizations mentioned above. (Note that we have switched to a slightly simpliﬁed notation.) The algorithm iterates over each candidate set xc evaluating its overlap with the probing set xp . First, the starting point for scanning both sets is located (lines 03–06). The approach used here is similar to ppjoin (see [13] for more details). Note for both sets, the algorithm starts scanning from the feature following either the last match of candidate generation, i.e., i+1 or j +1, or the preﬁxes. No common feature between xp and xc is missed, because only min-preﬁx elements were used for score accumulation during candidate generation. Otherwise, we could miss a match on a stale feature at position j, xc .prefpos < j, whose reference to xc in the associate inverted list had been previously removed.

Eﬃcient Set Similarity Joins Using Min-preﬁxes

97

The merge-join-based overlap takes place thereafter (lines 7–11). Feature matches increment the overlap accordingly; for each mismatch, the break condition is tested, which consists in verifying if there are enough remaining features in the set relative to the currently tested feature (line 10). Finally, the overlap constraint is checked and the candidate pair is added to the result if there is enough overlap (lines 12–13). 4.2

Optimizing Overlap Score Accumulation

Reference [2] argues that hash-based score accumulators and sequential list processing provide superior performance compared to the heap-based merging approach of other algorithms (e.g., [9]). We now propose a simpler approach by eliminating dedicated data structures and corresponding operations for score accumulation altogether: overlap scores (and the matching positional information) can be stored in the indexed set itself as attributes in the same way as the min-preﬁx size information. Therefore, overlap score can be directly updated as indexed sets are encountered in inverted lists. We just have to maintain an (re-sizeable) array to store the candidate sets, which will be passed to the Verify procedure. Finally, after verifying each candidate, we clear its overlap score and matching positional information.

5

The Weighted Case

We now consider the weighted version of the set similarity join problem. In this version, sets are drawn from a universe of features Uw , where each feature f is associated with a weight w (f ). All concepts presented in Sect. 2 can be easily modiﬁed to accord with weighted sets. The weighted size of a set x, denoted as w (x), is given by the summation of the weight of its elements, i.e., w (x) = f ∈x w (f ). Correspondingly, the weighted Jaccard similarity (WJS), for example, is deﬁned as WJS (x1 , x2 ) = w (x1 ∩ x2 )/w (x1 ∪ x2 ). The preﬁx deﬁnition has to be slightly modiﬁed as well. Given an overlap bound α, the weighted preﬁx of a set x, denoted as pref (x), is the shortest subset of x such that w (pref (x)) > w (x) − α. We now present the weighted version of mpjoin, called w-mpjoin. The most relevant modiﬁcations are listed in Alg. 4. As main diﬀerence to mpjoin, wmpjoin uses the sum of all feature weights up to a given feature instead of feature positional information. For this reason, we deﬁne the cumulative weight of a feature fi ∈ x as c (fi ) = w (fj ), where 1 ≤ j ≤ i. We then index c (fi ), for each fi ∈ midpref (x) and set the prefsize to the cumulative weight of the last feature in midpref (x) (lines 17–21). Note that feature positional information is still necessary to ﬁnd the starting point of scanning in the Verify procedure. The utility of the cumulative weight in the candidate generation is twofold. First, it is used for overlap bound checking. Given c (fi ), the cumulative weight of the features following fi in x is crem (x, i) = w (x) − c (fi ). Hence, crem can be used to verify whether or not there are enough remaining cumulative weights

98

L.A. Ribeiro and T. H¨ arder

Algorithm 4. The w-mpjoin algorithm 5 6 7 8 9 10 11 12 13 14 15

... foreach fi ∈ maxpref (xp ) do // candidate generation phase Remove all (xc , c (fj ) , j) ∈ If from If s.t. w (xc ) < minsize (xp ) foreach (xc , c (fj ) , j) ∈ If do if xc .prefsize + w (fj ) < c (fj ) Remove (xc , c (fj ) , j) from If // I (xc , c (fj ) , j) is stale continue M (xc ) ← (M (xc ) .os + w (fj ) , i, j) if M (xc ) .os + min (crem (xp , i) , crem (xc , j)) < minoverlap (xp , xc ) M (xc ) .os ← −∞ // do not consider xi anymore if M (xc ) .os + crem (xc , j) < minoverlap (xp , xc ) Remove (xc , c (fj ) , j) from If // I (xc , c (fj ) , j) is stale xc .prefsize ← w (xc ) − minoverlap (xp , xc ) // update prefix size

20

S ← S ∪ Verify (xp , M, γ) // verification phase cweight ← 0 foreach fi ∈ midpref (x) do cweight ← cweight + w (fi ) If ← If ∪ {(xp , cweight , i)}

21

xp .prefsize ← cweight

16 17 18 19

22

...

to reach the overlap bound (lines 11 and 13). Second, the cumulative weight is used to identify stale features by comparing it with prefsize (line 07). Note that the cumulative weight of the last feature in minpref (xc , xp ) is always greater than w (xc ) − α, for α = minoverlap (xp , xc ). Hence, to be sure that a given feature is stale, we have to add the weight of the current feature to prefsize before comparing it to the cumulative weight. Due to space constraints, we do not discuss the weighted version of the Verify procedure, but the modiﬁcations needed are straightforward.

6 6.1

Experiments Experimental Setup

The main goal of our experiments is to measure the runtime performance of our algorithms, mpjoin and w-mpjoin, and compare them against previous, state-ofthe-art set similarity join algorithms. All tests were run on an Intel Xeon Quad Core 3350 2,66GHz Intel Pentium IV computer (two 3.2 GHz CPUs, about 2.5 GB main memory, Java Sun JDK 1.6.0). Algorithms. We focused on index-based algorithms, because they consistently outperform competitor signature-based algorithms [2] (see discussion in Sect. 7) and implemented the best known index-based algorithms due to Xiao et. al [13]. For unweighted sets, we used ppjoin+, an improved version of ppjoin, which applies a suﬃx ﬁltering technique in the candidate generation phase to

Eﬃcient Set Similarity Joins Using Min-preﬁxes

99

Fig. 3. Feature frequency and set size distributions

substantially reduce the number of candidates. This algorithm constitutes an interesting counterpoint to mpjoin. We also investigated a hybrid version, which combines mpjoin and ppjoin+ by adding the suﬃx ﬁltering procedure in mpjoin (Alg. 2, inside the loop of line 6 and after line 15). As recommended by the authors, we performed suﬃx ﬁltering only once for each candidate pair and limited the recursion level to 2. For weighted sets, however, it is not clear how to adapt the suﬃx ﬁltering technique, because the underlying algorithm largely employs set partitioning based on subset size. In contrast, when working with weighted sets, cumulative weights have to be used, which requires subset scanning to calculate them also for unseen elements. For this reason, this approach is likely to result in poor performance. Therefore, we refrained from using ppjoin+ and instead employed our adaptation of ppjoin for weighted sets, denoted w-ppjoin. We only considered the in-memory version of all algorithms. Reference [2] presented a simple nested-loop algorithm fetching entire blocks of disk-resident data, which could be easily adapted for the algorithms evaluated here. Because the same in-memory algorithm is used in each outer-loop iteration and IO overhead is similar in all algorithms, the relative diﬀerence in the results reported for the in-memory algorithms should also hold for their external-memory version. For evaluation of weighted sets, we used the well-known IDF weighting scheme. Finally, due to space constraints, we only report results for the Jaccard similarity. The corresponding results for other similarity functions follow identical trends. Datasets. We used two well-known real datasets: DBLP (dblp.uni-trier.de/xml) containing computer science publications and IMDB (www.imdb.com) storing information about movies. We extracted 0.5M strings from each dataset; each string is a concatenation of authors and title, for DBLP, and movie title together with actor and actress names, for IMDB. We converted all strings to upper case letters and eliminated repeated white spaces. We then generated 4 additional “dirty” copies of each string, i.e., duplicates to which we injected textual modiﬁcations consisting of 1–5 character-level modiﬁcations (insertions, deletions, and substitutions). Finally, we tokenized all strings into sets of 3-grams, ordered the tokens as described in Sect. 2, and stored the sets in ascending size order. With this procedure, we simulated typical duplicate elimination scenarios [1,5]. Figure 3 shows the feature frequency and set size distributions. The feature frequency distribution of both data sets follows a similar power-law distribution and therefore only the distribution of IMDB is shown. In contrast, the set size distributions are quite diﬀerent: DBLP has a clear average around 100, whereas IMDB does not seem to cluster around any particular value.

100

L.A. Ribeiro and T. H¨ arder

(a) DBLP using unweighted sets

(b) IMDB using unweighted sets

(c) DBLP using weighted sets

(d) IMDB using weighted sets

Fig. 4. Runtime experiments

6.2

Results

Figure 4a and 4b illustrate the performance results for the unweighted version of the algorithms with varying Jaccard similarity threshold. In all settings, mpjoin clearly exhibits the best performance. For the DBLP dataset, mpjoin achieves more than twofold speed-ups over ppjoin+ for thresholds lower than 0.85, whereas the performance gains are up to a factor of 3 for the IMDB dataset. Note, the performance advantage of mpjoin is more prominent at lower thresholds. In such cases, more stale features are present in the inverted lists, for which a larger number of unqualiﬁed candidate sets is generated. Hence, as a result, the performance of ppjoin+ degrades by a substantially stronger degree. Note, even the hybrid version is slower than mpjoin. The candidate reduction does not pay-oﬀ the extraeﬀort of suﬃx ﬁltering. To highlight this observation, Fig. 5 shows the ﬁltering behavior of the algorithms. In the charts, we show the number of candidates eliminated by suﬃx ﬁltering (SUFF) and overlap bounds (O BOUND) during cadidate genera- Fig. 5. Filtering behavior, IMDB, 0.8 ths tion (see Alg. 1, line 8), and the number of candidate pairs considered in the veriﬁcation phase (CAND). Note, ppjoin+ eliminates more candidates using O BOUND than mpjoin and hybrid. But, a large part of them are candidates related to stale features, i.e., irrelevant candidates that are repeatedly considered along their validity window.

Eﬃcient Set Similarity Joins Using Min-preﬁxes

101

Finally, we observe that all algorithms are about two times faster on the IMDB dataset. This is due to the wider set size distribution in the IMDB dataset, which results in shorter validity windows for indexed sets. The results for weighted sets are shown in Fig. 4c and 4d. Again, w-mpjoin is the most eﬃcient algorithm: it consistently achieves about twofold speed-ups compared to w-ppjoin. In general, the results for weighted sets show the same trends as those of the unweighted sets. As expected, the results of all algorithms are considerably faster than those for the unweighted case, because the weighting scheme results in shorter preﬁxes.

7

Related Work

A rich variety of techniques have been proposed to improve time eﬃciency of set similarity joins. Some examples of such techniques are probabilistic dimension reduction methods [3], signature schemes [1,5], derivation of bounds (e.g., size bound [9,1,2,13,7]), and exploitation of a speciﬁc data set order [9,2,13]. Additionally, there are two main query processing models. The ﬁrst uses an unnested representation of sets in which each set element is represented together with the corresponding object identiﬁer. Here, query processing is based on signature schemes and commonly relies on relational database machinery: equi-joins suported by clustered indexes are used to identify all pairs sharing signatures, whereas grouping and aggregation operators together with UDFs are used for veriﬁcation [5,1]. In the second model, an index is built for mapping features to the list of objects containing that feature [9,2,13]—for self-joins, which can be dynamically performed as the query is processed. The index is then probed for each object to generate the set of candidates which will be later evaluated against the overlap constraint. Previous work has shown that approaches based on indexes consistently outperform signature-based approaches [2] (see also [7] for selection queries). As primary reason, a query processing model based on indexes provides superior optimization opportunities. A major method for that uses an index reduction technique [2,13], which minimizes the number of features to be indexed. Furthermore, most signature schemes are binary, i.e., a single shared signature suﬃces to elect a pair of sets as candidates. Also, signatures are solely used to ﬁnd candidates; matching signatures are not leveraged in the veriﬁcation phase. As a result, each set in a candidate pair must be scanned from the beginning to compute their similarity. In contrast, approaches based on indexes accumulate overlap scores already during candidate generation. Hence, the set elements accessed in this phase can be ignored in the veriﬁcation.

8

Conclusion

In this paper, we proposed a new index-based algorithm for set similarity joins. Following a completely diﬀerent approach from previous work, we focused on a reduction of the computational cost for candidate generation as opposed to a lower number of candidates. For this reason, we introduced the concept of

102

L.A. Ribeiro and T. H¨ arder

min-preﬁx, a generalization of the preﬁx ﬁltering concept, which allows to dynamically and safely minimize the length of the inverted lists; hence, a larger number of irrelevant candidate pairs are never considered and, in turn, a drastic decrease of the candidate generation time is achieved. As a side-eﬀect of our approach, the workload of the veriﬁcation phase is increased. Therefore, we optimized this phase by stopping as early as possible the computation of candidate pairs that do not meet the overlap constraint. Finally, we improved the overlap score accumulation by storing scores and auxiliary information within the indexed set itself instead of using a hash-based map. Our experimental results on real datasets conﬁrm that the proposed algorithm consistently outperforms previous ones for both unweighted and weighted sets. Acknowledgement. Work supported by CAPES/Brazil; grant BEX1129/04-0.

References 1. Arasu, A., Ganti, V., Kaushik, R.: Eﬃcient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918–929 (2006) 2. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131–140 (2007) 3. Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997) 4. Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Eﬃcient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805–818 (2008) 5. Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006) 6. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491–500 (2001) 7. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267–276 (2008) 8. Li, C., Lu, J., Lu, Y.: Eﬃcient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257–266 (2008) 9. Sarawagi, S., Kirpal, A.: Eﬃcient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743–754 (2004) 10. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678–684 (2005) 11. Xiao, C., Wang, W., Lin, X.: Ed-Join: An Eﬃcient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008) 12. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916–927 (2009) 13. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Eﬃcient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131–140 (2008)

Probabilistic Granule-Based Inside and Nearest Neighbor Queries Sergio Ilarri1 , Antonio Corral2 , Carlos Bobed1 , and Eduardo Mena1 1

2

Dept. of Computer Science and Systems Engineering, University of Zaragoza, 50018 Zaragoza, Spain {silarri,cbobed,emena}@unizar.es Dept. of Languages and Computing, University of Almeria, 04120 Almeria, Spain [email protected]

Abstract. The development of location-based services and advances in the ﬁeld of mobile computing have motivated an intensive research eﬀort devoted to the eﬃcient processing of location-dependent queries. In this context, it is usually assumed that location data are expressed at a ﬁne geographic precision. Adding support for location granules means that the user is able to use his/her own terminology for locations (e.g., GPS, cities, states, provinces, etc.), which may have an impact in the semantics of the query, the way the results are presented, and the performance of the query processing. Along with its advantages, the management of the so-called location granules introduces new challenges for query processing. In this paper, we analyze two popular location-dependent constraints, inside and nearest neighbors, and enhance them with the possibility to specify location granules. In this context, we study the problem that arises when the locations of the objects are subject to some imprecision.

1

Introduction

Nowadays, there is a great interest in mobile computing, motivated by an everincreasing use of mobile devices, that aims at providing data access anywhere and at anytime. In the mobile computing ﬁeld, there has been an intensive research eﬀort in location-based services (LBS). These services provide value-added by considering the locations of the mobile users in order to oﬀer more customized information. How to eﬃciently process continuous location-dependent queries (e.g., tracking the available taxi cabs near a moving user) is one of the greatest challenges in location-based services. Thus, these queries require a continuous monitoring of the locations of relevant moving objects in order to keep the answer up-to-date eﬃciently. Moreover, even if the set of objects satisfying the query condition does not change, their locations and distances to the user do change continuously, and therefore the answer to the query must be updated with the new location data (e.g., to update the locations of the objects on a map). Existing works on location-dependent query processing implicitly assume GPS locations for the objects in a scenario (e.g., [1,2]). However, precise locations may J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 103–117, 2009. c Springer-Verlag Berlin Heidelberg 2009

104

S. Ilarri et al.

be unavailable or even be inconvenient for the user. Thus, for example, providing the user the latitude/longitude of the objects in an answer is probably of little use unless this information is combined with some kind of map. For instance, a train tracking application could just need to consider in which city the train currently is, not its exact coordinates. For such applications, it is useful to deﬁne the concept of location granule (similar to the concept of place in [3]) as a set of physical locations. In the previous example, every city would correspond with a location granule. Other examples of location granules could be: freeways, buildings, oﬃces in a building, etc. Notice that managing location granules instead of precise geographic locations could also be interesting for privacy reasons. The use of location granules to enhance the expressivity of location-dependent queries was ﬁrst proposed in [4], that also considered the basic aspects of inside constraints with location granules. The idea is that the user should be able to express queries and retrieve results according to the concept of “location” that he/she requires, whether he/she needs to talk in terms of GPS locations (the ﬁnest type of location granule possible) or locations at a diﬀerent resolution. As described in that work, the use of location granules can have an impact on: 1) the presentation of results (location granules can be represented by using graphics, text, sounds, etc., depending on the requirements of the user), 2) the semantics of the queries (the user expresses the queries according to his/her own location terminology, and therefore the answers to those queries will depend on the interpretation of location granules), and 3) the performance of the query processing (the location tracking overload is alleviated when coarse location granules, instead of precise GPS locations, are used). However, the use of location granules calls for the deﬁnition of new query processing approaches. The fact that most locations are inherently uncertain increases the diﬃculty of this task. In this paper, we focus on query processing issues and study in detail how inside constraints1 and nearest neighbor constraints can be processed by taking into account the possible uncertainty of the locations managed for the objects involved. Dealing with uncertainty leads us to consider probabilistic granule-based inside queries and probabilistic granule-based nearest neighbor queries that, as far as we know, have not been considered in the literature. As an example of the ﬁrst type of query, imagine that we want to monitor police units (e.g., police cars and policemen) that are with a probability of at least 80% within a radius r of the building where a certain suspect is currently located; alternatively, we may want to monitor all the police units that may be within that radius and obtain the probability that they are actually within the radius. As an example of the second type of query, let us suppose a group of tourists arriving at an airport and that need ﬁve taxi cabs to reach their hotels: they could query about the ﬁve taxi cabs that are (most probably) the nearest to their terminal. As opposed to a classical nearest neighbor query without locations granules, this query will return taxis outside the boundaries of 1

By inside constraints we mean constraints that are satisﬁed by objects located within a certain circular range around a given moving object.

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

105

the terminal only if there are less than ﬁve taxis available in the terminal (e.g., maybe calling a taxi from a diﬀerent terminal is more expensive). The structure of the rest of the paper is as follows. In Section 2, we brieﬂy describe the datatypes and the basic architecture that we consider. In Section 3, we explain the mechanism proposed to process inside constraints with location granules. In Section 4, we focus on nearest constraints. In Section 5, we describe our approach to deal with uncertainty. In Section 6, we present some related works. Finally, some conclusions and plans for future work appear in Section 7.

2

Datatypes and Basic Architecture

A location granule is composed of one or more geographic areas which identify a set of GPS locations under a common name. For example, Madrid is a location granule of type city, such that it can be said that a certain car is in Madrid or in the location (x,y), depending on the location granularity required (city or GPS granularity, respectively). The datatypes managed in the proposed system are summarized in Table 1. Objects are characterized by an identiﬁer, a location (loc.x and loc.y), a class, and probably other attributes speciﬁc to their class. A location granule has an identiﬁer and is represented by a set of ﬁgures (F s). It provides three main operators: inGr (short for inGranule) returns a boolean indicating whether a certain GPS location is within the granule, distGr (distanceGranule) computes the limits-distance between the GPS location provided and the granule (deﬁned as the minimum distance to the boundaries of the areas composing the granule), and distBtwGrs (distanceBetweenGranules) computes the distance between two granules. A location granule mapping has an identiﬁer and is composed by a set of granules. The following main operators are deﬁned for granule mappings: getGrs (getGranules) returns the subset of granules that contain the given GPS location, getNGr (getNearestGranule) obtains the nearest granule to the speciﬁed GPS location, and ﬁnally getGrsObj (getGranulesObject) and getNGrObj (getNearestGranuleObject) are similar to the previous two operators but considering an object instead of a GPS location. For operators that need to return a single answer (e.g., getNGrObj), if there are several results satisfying the operator (e.g., two granules at the same distance from the object) then one is returned randomly. For brevity, we will use gr instead of getNGr in the rest of the paper. It should be noted that several disconnected areas can deﬁne a location granule; for example, Spain could be seen as a location granule that consists of its peninsular provinces and its islands. Diﬀerent granule mappings could be deﬁned over the same geographic area, and the user can choose the most appropriate for his/her context. Moreover, a granule can belong to several mappings at the same time. In this way, granule deﬁnitions can be re-used to compose diﬀerent granule mappings. For instance, if we have a granule mapping M 1 composed of granules corresponding to the diﬀerent provinces of Spain and another mapping M 2 with granules deﬁning regions in Spain, we can build a new granule mapping

106

S. Ilarri et al. Table 1. Datatypes and main operators

Datatype Tuple format Operators Object (O) representObject Location Granule (G) inGr: G x GPS → Boolean distGr: GPS x G → Real distBtwGrs: G x G → Real representGranule Granule Mapping (M) getGrs: M x GPS → ℘(G) getNGr: M x GPS → G getGrsObj: M x O → ℘(G) getNGrObj (gr): M x O → G

where some regions in M 2 are replaced by the corresponding province granules in M 1 (because we could want a ﬁner location granularity within those regions). We focus on the query processing of location-dependent queries with location granules; for example, “retrieve the cars that are within 100 miles of the city where car38 is, showing their locations with city granularity” (i.e., indicating the city where each retrieved car is). To process these types of queries, two main architectural elements are considered: – The Location Server is a module of a Server computer which handles location data about moving objects and is able to answer standard SQL-like queries about them. No assumption is made about the way that this location information is managed (e.g., stored in databases, estimated using predeﬁned trajectories, or pulled on demand from the moving objects themselves). – The Query Processor is a module of the Server able to process locationdependent queries with location granules by interacting with the Location Server. This module will consider the granule mappings speciﬁed by the user in the query constraints. These mappings may be deﬁned by the user himself/herself (User Mappings) or be predeﬁned (Server Mappings). For illustrative purposes, we use an SQL-like syntax to express the queries and constraints, which allows us to emphasize the use of location granules and state the queries concisely. The structure of location-dependent queries is: SELECT projections FROM sets-of-objects WHERE boolean-conditions

where sets-of-objects is a list of object classes that identify the kind of objects interesting for the query, boolean-conditions is a boolean expression that selects objects from those included in sets-of-objects by restricting their attribute values and/or demanding the satisfaction of certain location-dependent constraints, and projections is the list of attributes or location granules that must be retrieved from the selected objects.

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

107

Speciﬁcations of granule mappings can appear in the SELECT and/or in the WHERE clause of a query, depending on whether those location granules must be used for the visualization of results or for the processing of constraints, respectively. If no location granule mappings are speciﬁed, GPS locations are assumed. In this paper, we only focus on query processing issues2 . For clarity, we will consider that granule mappings are speciﬁed by using the gr operator, but using getGrsObj is also possible.

3

Processing Inside Constraints with Location Granules

In this section, we explain how inside constraints are processed. The general syntax of an inside constraint is inside(r, obj, target), which retrieves the objects of a certain class target (such objects are called target objects and their class the target class) within a speciﬁc distance r (which is called the relevant radius) of a certain moving object obj (that is called the reference object of the constraint). In this general syntax, the locations of the reference object and the target objects are considered as GPS locations. However, the second and/or the third argument of the inside constraint can specify that the location has to be interpreted according to a certain granule mapping instead, as we will explain in the following. Thus, obj can be replaced by gr(map1, obj) and target can be replaced by gr(map2, target), where map1 and map2 are granule mapping identiﬁers (not necessarily the same one). As an example, “SELECT Car.id FROM Car WHERE inside(130 miles, car38, Car)” is a query that retrieves the cars within 130 miles of car38. In this example, the reference object is car38, the target class is Car, and the relevant radius is 130 miles. If granules are associated to the inside constraint of that query, three cases (with diﬀerent semantics) can be distinguished, as we will explain in the rest of this section3 . 3.1

Inside Constraint with a Granule for the Reference Object

In this case, the corresponding inside constraint is interpreted as follows: inside(r, gr(map, obj), target) = {oi | (oi ∈ target) ∧ (∃ p ∈ GP S | inGr(gr(map, obj), p) ∧ distance(p, (oi.loc.x, oi.loc.y)) ≤ r)}

where distance represents the Euclidean distance between two geographic locations. This constraint retrieves the target objects (instances of the class target ) whose distance from the granule of the reference object obj (according to the granule mapping map) is not greater than the relevant radius r. As an example, inside(130 miles, gr(province, car38), Car) is satisﬁed by the cars within 130 miles of the province of car38 (the reference object of the inside constraint). 2 3

We refer readers interested in the presentation aspect to [4] and the interactive applet at http://sid.cps.unizar.es/ANTARCTICA/LDQP/granulesRepresentation.html An interactive demonstration showing these diﬀerent cases is available as a Java applet at http://sid.cps.unizar.es/ANTARCTICA/LDQP/granules.html

108

S. Ilarri et al.

To obtain the objects that satisfy an inside constraint with a granule for the reference object, the following operations are performed: 1) the granule of the reference object is obtained; 2) the area/s corresponding to such a granule is/are enlarged by the relevant radius in order to obtain the relevant area/s; and 3) the target objects within that area are retrieved. The operation corresponding to the second step, which implies computing the Minkowski sum [5] of the area/s composing the granule and a disk with radius the relevant radius, is called buﬀering in the context of Geographic Information Systems [6]: buﬀer(r, granule) = granule’ | (|granule.F s| = |granule .F s|) ∧ ∧ (∀ Fi’ ∈ granule’.Fs: ∃ Fi ∈ granule.Fs | Fi’ = buﬀer(r, Fi)), where: –buﬀer(r, A) = A’ | (∀ p ∈ GPS: contains(A, p)=⇒ containsF(A’, circle(p, r)) –containsF(area2, area1) ⇐⇒ (∀ p ∈ GPS: contains(area1, p) =⇒ contains(area2, p))

3.2

Inside Constraint with a Granule for the Target Class

An inside constraint can include a location granule for the target class, in which case the constraint is interpreted as follows: inside(r, obj, gr(map, target)) = {oi | (oi ∈ target) ∧ (∃ p ∈ GP S | inGr(gr(map, oi), p) ∧ distance(p, (obj.loc.x, obj.loc.y)) ≤ r)}

That is, the constraint is satisﬁed by the target objects (instances of the class target ) located in location granules (deﬁned by the granule mapping map) whose boundaries intersect a circle of the relevant radius r centered on the reference object obj. As an example, the constraint inside(130 miles, car38, gr(province, Car)) is satisﬁed by the cars located in provinces whose boundaries are (totally or partially) within 130 miles of car38. To obtain the objects that satisfy an inside constraint with a location granule for the target class, the following operations are performed: 1) a circular area of radius the relevant radius centered on the current GPS location of the reference object is computed; 2) the granules intersected by such an area are determined; and 3) the target objects within any of those granules are obtained. 3.3

Inside Constraint with Granules for the Reference and Target

This ﬁnal situation is a mixture of the two previous cases. The corresponding inside constraint is interpreted as follows: inside(r, gr(map1, obj), gr(map2, target)) = {oi | (oi ∈ target) ∧ (∃ p1 ∈ GP S, ∃ p2 ∈ GP S | (distance(p1, p2) ≤ r) ∧ inGr(gr(map2, oi), p1) ∧ inGr(gr(map1, obj), p2)}

That is, this constraint is satisﬁed by the target objects (instances of the class target ) located in location granules (deﬁned by the granule mapping map2 ) whose boundaries are within the relevant radius r from the granule of the reference object obj (determined according to the granule mapping map1 ). As an

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

(a) Step 1

(b) Step 2

109

(c) Step 3

Fig. 1. Inside with granules for the reference object and the target class: steps

example, the constraint inside(130 miles, gr(province, car38), gr(province, Car)) is satisﬁed by the cars located in provinces whose borders are (total or partially) within 130 miles of the province where car38 is. In this example, the same type of granule province is speciﬁed for both the reference object and the target class. In this case, to obtain an answer, an area is computed ﬁrst by enlarging the granule of the reference object by the relevant radius (see Figure 1.a), exactly as it is done in the steps 1 and 2 in the case explained in Section 3.1 (case 1). Then, the set of granules intersected by such an area are determined (see Figure 1.b), similarly to step 2 in the case considered in Section 3.2 (case 2). Finally, the objects within those granules are retrieved (see Figure 1.c).

4

Processing NN Constraints with Location Granules

In this section, we explain how nearest neighbor constraints are processed. The general syntax of a nearest neighbor constraint is nearest(N, obj, target), that is satisﬁed by the N objects of the class target that are the nearest ones to the reference object obj. The argument N is optional (if not provided, N is assumed to be one). Unless speciﬁed otherwise, the locations of the reference object and the target objects are considered as GPS locations. As an example, the query “SELECT Car.* FROM Car WHERE nearest(car38, Car)” retrieves the attributes of the nearest car to car38 in terms of geographic distance between GPS coordinates. However, the second and/or the third argument of the nearest constraint can specify that the location has to be interpreted according to a certain granule mapping instead. If granule mappings are speciﬁed in a query, three cases can be distinguished, as we will discuss in the rest of this section. 4.1

Nearest Constraint with a Granule for the Reference Object

In this case, the corresponding nearest constraint is interpreted as follows: nearest(N, gr(map, obj), target) = S | (|S| = N ) ∧ (gr(map, obj) = g) ∧ ∧ S = {oi | (oi ∈ target) ∧ ( ∃oj ∈ S | distGr(oj.loc, g) < distGr(oi.loc, g))}

That is, the constraint is satisﬁed by the N objects of the class target that are the nearest to the granule of obj (according to the granule mapping map). As an

110

S. Ilarri et al.

(a)

(b)

Fig. 2. Nearest neighbor queries in diﬀerent scenarios: (a) 4-NN and (b) 2-NN

example, the constraint nearest(5, gr(province, car38), Car) is satisﬁed by the ﬁve nearest cars to the province of car38. To obtain the objects that satisfy a nearest constraint with a granule for the reference object, the following operations are performed. First, the granule of the reference object is obtained. Then, the objects of the class target within that granule are retrieved; let us assume that the number of objects retrieved is M . All those objects are closer to the granule than any other object outside the granule. If M ≥ N , then any N of those M objects can be retrieved as satisfying the nearest constraint, as all of them have the same distance to the granule (that is, zero). If M < N , then we must retrieve N = N − M additional objects. For this, we inspire by [7], that retrieves the objects within a query sphere which increases iteratively until all the nearest objects required are obtained. In our case, to retrieve additional objects, we apply a buﬀering operation buf f er(r , gr(map, obj)) on the granule of the reference object obj, as explained in Section 3.1, and retrieve the objects within that enlarged granule. The best choice for the expansion radius r would be the smallest value that allows us to retrieve N additional objects. Unfortunately, we can only try to guess an appropriate value for r . If less than N objects are retrieved, we need to expand r and repeat the operation until we have retrieved a number of objects M ≥ N . Then, we sort these objects according to their distances to gr(map, obj) and consider the ﬁrst N objects to complete the set of objects satisfying the constraint (the order between two objects at the same distance is arbitrary). In practice, the maximum radius r to consider will be limited, as the user will not be interested in objects further than a certain distance from the reference object [8]. In that case, the number of objects returned may be smaller than N . We show a couple of examples in Figure 2. In the scenario on the left, we consider a 4-NN query, which would retrieve the four objects o1, o2, o3, and o4. One of these objects (o2) is within the granule (and therefore at distance zero from the reference object) and the other three are obtained by using the expansion mechanism described above. In the scenario on the right, we consider a 2-NN query which retrieves the objects o1 and o2. It should be noted that one of the objects retrieved (o1) is clearly further in terms of Euclidean distance than one of the objects that is not retrieved (o3). For simplicity, we consider the

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

111

same expansion radius in both scenarios, although it could obviously be reduced for the scenario on the right. We would like to emphasize that if there are additional standard (i.e., not location-dependent) SQL constraints in the query that aﬀect the target class target, those constraints must be veriﬁed before selecting the N objects. Otherwise, some of those objects could be discarded later if they do not satisfy such constraints, and the processing would end up with less that the N objects required. For simplicity, and without loss of generality, we will assume in the rest of the paper that no such constraints exist. 4.2

Nearest Constraint with a Granule for the Target Class

In this case, the corresponding nearest constraint is interpreted as follows: nearest(N, obj, gr(map, target)) = S | (|S| = N ) ∧ ∧ S = {oi | (oi ∈ target) ∧ (gr(map, oi) = gi) ∧ ∧ ( ∃oj ∈

S | (gr(map, oj) = gj) ∧ distGr(obj.loc, gj) < distGr(obj.loc, gi))}

That is, the constraint is satisﬁed by the N objects of the class target that are in granules that are the closest ones to the geographic location of the reference object obj. As an example, the constraint nearest(5, car38, gr(province, Car)) is satisﬁed by the ﬁve cars that are in the provinces whose boundaries are the closest ones to the current location of car38. To obtain the objects that satisfy a nearest constraint with a granule for the target class, we follow an approach similar to the one described for the case where the location granule aﬀects the reference object. However, in the iterative step, instead of a buﬀering operation with increasing radius, we need to consider a query sphere (centered on the location of obj) with increasing radius, as in [7]. The granules intersecting the sphere, and the objects of the class target within those granules, are retrieved, similarly to what was explained in Section 3.2 for inside constraints. Once we have performed enough iterations to collect at least N objects, we sort them according to the distances of their granules to the location of obj and retrieve the N objects with the smallest distances. 4.3

Nearest Constraint with Granules for the Reference and Target

In this case, the corresponding nearest constraint is interpreted as follows: nearest(N, gr(map1, obj), gr(map2, target)) = S | (|S| = N ) ∧ (gr(map1, obj) = g) ∧ ∧ S = {oi | (oi ∈ target) ∧ (gr(map, oi) = gi) ∧ ∧ ( ∃ oj ∈ S | (gr(map, oj) = gj) ∧ distBtwGrs(g, gj) < distBtwGrs(g, gi))}

That is, the constraint is satisﬁed by the N objects of the class target that are in granules (of the mapping map2) that at the closest ones to the granule of the reference object obj (according to the mapping map1). As an example, the constraint nearest(5, gr(province, car38), gr(province, Car)) is satisﬁed by the ﬁve cars that are in the provinces whose boundaries are the closest ones to the current province

112

S. Ilarri et al.

of car38. In this example, the same mapping is used for both the reference object and the target class, but this is not required. To obtain the objects that satisfy a nearest constraint with a granule for the reference object and the target class, we apply iteratively buﬀering operations on the granule of the reference object (according to the mapping map1) as described in Section 4.1. In each iteration, we compute the granules of the mapping map2 that intersect the obtained buﬀer and retrieve the objects of the class target within. Once we have collected N objects, we stop iterating. Then, we just need to sort the objects according to the distances between their map2-based granules and the map1-based granule of the reference object, and return the N objects with the smallest distances. The distance between two granules can be computed as the minimum of the distances between the ﬁgures composing them4 . Notice that some optimizations are possible if map1 = map2. Thus, for example, we could start by retrieving the objects within the granule of the reference object as explained in Section 4.1. If the mappings to consider for the reference object and the target class are diﬀerent, then obtaining such objects is of no use.

5

Dealing with Imprecise Locations

In the previous description, there is the implicit assumption that it is possible to obtain the GPS location of an object by querying the corresponding Location Server. However, sometimes this ﬁne-grain location information is not available. This could be due to the imprecision inherent to the location mechanism used to obtain the locations of the moving objects (e.g., if the cell-ID positioning mechanism is used [9], a location can be as imprecise as the size of the cell where it is contained) or due to performance reasons (the tracking cost incurred to keep a GPS location up-to-date is higher than when a coarser location resolution is used). These coarse-grain locations can be considered as special kinds of location granules, which are called uncertainty-based location granules in this paper. Two types of uncertainty-based location granules can be identiﬁed: – Relative uncertainty-based location granules. This occurs when the geographic location of the object is available but there is a certain degree of uncertainty. For example, if the Location Server knows that the precision of the GPS location of an object is within ﬁve meters, then the granule is deﬁned as a circle of a ﬁve-meter radius centered on that GPS location. Three sources of uncertainty can be considered by the Location Server to compute this imprecision: the inherent imprecision of the positioning technology (e.g., GPS), the latency in communicating the location to the Server, and the threshold used by the location update policy to minimize the tracking cost (e.g., see [10]). – Absolute uncertainty-based location granules. This means that the actual location is within a certain geographic area. For example, let us consider a 4

To compute the distance between two polygons, the algorithm proposed in http://cgm.cs.mcgill.ca/~ orm/mind2p.html can be used.

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

113

Bluetooth-based5 positioning mechanism that allows to determine the room where a person is located within a building. In this example, not only the precision of the positioning mechanism advises the use of such a granularity level, but a more precise location would probably be useless too. For clarity, in the following we will explain separately the case of uncertaintybased location granules that aﬀect a reference object and the case of granules that aﬀect the target objects (although, obviously, both situations can co-exist). 5.1

Uncertainty-Based Location Granule for a Reference Object

If a location granule mapping has been speciﬁed in the query for the reference object and the retrieved location of the reference object is an uncertainty-based location granule, then obtaining a granule according to the speciﬁed mapping can return several granules. This is because the uncertainty-based location granule could intersect several granules in the required granule mapping. The unnamed granule whose area is the union of the areas of all such granules must be used as the granule indicating the location of the reference object. Such a granule will be called a union-based location granule because it is obtained as the union of several granules deﬁned in the granule mapping speciﬁed by the user. At this point, as other works do (see Section 6), we assume that the Location Server provides, along with the location of the reference object, a probability density function (pdf) that represents the distribution of the probabilities of the possible locations of the object. In this case, we could use that pdf to assign to each of the intersecting granules the probability that the object is actually within such a granule6 , by evaluating an integral. Then, three options are possible: – If the user is just interested in the most probable answer, we could just consider the most probable granule for the reference object –the granule with the highest probability p computed– and continue processing the constraint as usual. The answer obtained will be the correct one with probability p. – If the user is interested in all the possible answers, tagged with their probability, or in a may/must interpretation of the query, we could process the constraint several times, one for each granule composing the union-based location granule. In this case, each constraint obtains a set of objects that satisfy the constraint with a certain probability (the probability that the reference object is within the corresponding granule). The union of the answers for each of those constraints would lead to the query interpreted with may semantics, whereas the intersection implies a must semantics [11]. – If the user is interested in a may interpretation of the query, besides the previous approach (less eﬃcient), we can also simply consider the unionbased location granule and return all the objects that may be in the answer, independently of the probability that they are part of the answer. 5 6

http://www.bluetooth.com If no pdf is available, we could just assume (for example) a uniform distribution and compute the probabilities by considering the percentage of overlapping area.

114

S. Ilarri et al.

Scenario 1 (granules for objects o1 and o2)

Scenario 2 (GPS locations for objects o1 and o2) object o1 (GPS location)

granule g1 for object o1 uncertainty−based granule g for object o

uncertainty−based granule g for object o granule g2 for object o2

object o2 (GPS location)

Fig. 3. Uncertainty-based location granules: two sample scenarios

Let us consider now the case where no granule mapping has been speciﬁed for the reference object. This implies that the uncertainty-based location granule of the reference object has no semantic meaning (i.e., it is not composed of underlying location granules that are in a granule mapping speciﬁed by the user). Instead, the user has expressed a constraint relative to the ﬁne-grained geographic location of the reference object and the use of the uncertainty-based location granule is simply forced by the lack of precision in the location available. For the case of an inside constraint, we proceed as explained above. However, if we have a nearest neighbor constraint, considering such a granule as if it were determined based on a mapping speciﬁed by the user would be misleading. As an example, in Figure 3 object o2 (instead of o1) is more likely to be the closest object to o, both in the left and in the right scenarios. As the uncertainty-based granule of o (or any of its underlying granules) has no meaning for the user, we must weigh the probability that the reference object be at any point within that granule when computing the distance to another object for nearest neighbor processing. Intuitively, if no granule is speciﬁed either for the target objects and the uncertainty-based granule of the reference object o is g, then we can statistically estimate the average distance to a target object ot as follows: distGr(ot .loc, g) =

∀p|inGr(g,p)

distance(p, ot .loc) ∗ P rob(p)

where P rob(p) represents the probability that the reference object is at location p (according to the corresponding probability density function). The previous expression simply tells us that we can compute the expected distance as the average of the possible distances weighted by their probabilities. Similarly, if there is a granule mapping speciﬁed for the target class and the target object ot is in granule g , we could estimate the average distance as: distBtwGrs(g, g ) =

∀p|inGr(g,p)

distGr(p, g ) ∗ P rob(p)

There are several methods available to compute the previous integrals, as explained in [12]. We have proposed here the computation of an expected distance, as advocated in works such as [12,13]. An alternative would be to compute the probability that an object ot is the kth nearest object to the reference object o, by considering the uncertain locations of the other objects as well (e.g., see [14]).

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

5.2

115

Uncertainty-Based Location Granule for a Target Object

In this case, for inside constraints the mapping should be performed at the level of the Location Servers. Thus, if a Location Server is asked about the moving objects within a certain area and the location of a certain object is an uncertainty-based location granule, then the Location Server will need to compute the intersection of such a location granule and the given area and return that object if the intersection is not empty. In case a granule mapping has been speciﬁed for the target class in that constraint, the underlying granules are also returned and can be processed similarly to what was explained in the previous section. If not, as in the previous case, the object can be assigned a probability of being actually within the area indicated. For nearest neighbor constraints, the uncertainty-based location granule for a target object can be managed similarly. First, the Location Server returns the target objects within the area requested (as explained above). Then, if the user was interested in the GPS locations of the target objects (i.e., he/she speciﬁed no location granule for the target class), then we can compute distances by weighting the probability that the object is at diﬀerent locations within the uncertainty-based location granule (as explained in Section 5.1 for the reference object). If, on the contrary, the user speciﬁed a granule mapping for the target class, we can use the probabilities assigned to the diﬀerent underlying granules to compute the possible answers.

6

Related Work

Both inside queries [1,2] and nearest neighbor queries [15,16] have been studied in the literature of spatio-temporal and moving object databases (see [17] for a survey of diﬀerent works in the ﬁeld). However, existing works on locationdependent query processing implicitly assume GPS locations for the objects in a scenario. Although some works acknowledge the importance of considering diﬀerent location resolutions (e.g., [3]), the processing of classical constraints such as inside or nearest is not considered in that context. The importance of dealing with the uncertainty of location information is emphasized in diﬀerent works. Probabilistic queries, even though not in the context of moving objects, were introduced in [18]. For moving objects, probabilistic queries are usually computed by estimating the locations of the objects through a probability density function that models the uncertainty, such that the probability that an object is within a certain region can be computed by integration. As solving these integrals is frequently expensive (numerical methods are usually required), a ﬁlter step is introduced to prune the search space. Diﬀerent relevant proposals exist in the literature. For example, probabilistic range queries are the focus of [19] and probabilistic nearest neighbor queries are studied in works such as [20,14]. As it is diﬃcult to provide a good overview of contributions in this area in a short space, we refer the interested reader to [17]. No existing proposal has considered probabilistic queries with location granules.

116

7

S. Ilarri et al.

Conclusions and Future Work

The expressivity of location-dependent queries can be enhanced by allowing the user to specify the use of location granules. This brings the query to the user’s level and may impact not only the query semantics but also the performance and the way the results are presented to the user. In this paper, we have focused on the semantics aspects by studying how inside and nearest neighbor constraints can be processed when location granules are involved. Moreover, we have proposed solutions to deal with uncertainty in the locations available. Thus, besides the processing of constraints with location granules, an important novelty of this work lies in the combination of location granules with probabilistic approaches. We are currently carrying out an exhaustive performance evaluation with diﬀerent real and synthetic granule mappings and locations within the context of the distributed location-dependent query processing system LOQOMOTION [8]. Our preliminary results are promising. As future work, we plan to study other popular location-dependent constraints (such as closest-pairs and similarity joins [21]) from the perspective of location granules. We will also consider the integration of other approaches to compute the probabilistic distances for nearest neighbor queries (e.g., by considering the proposal in [14]).

Acknowledgements Work supported by the projects TIN2007-68091-C02-02 and TIN2008-03063.

References 1. Cai, Y., Hua, K.A., Cao, G., Xu, T.: Real-time processing of range-monitoring queries in heterogeneous mobile databases. IEEE Transactions on Mobile Computing 5(7), 931–942 (2006) 2. Gedik, B., Liu, L.: MobiEyes: A distributed location monitoring service using moving location queries. IEEE Transactions on Mobile Computing 5(10), 1384–1402 (2006) 3. Hoareau, C., Satoh, I.: A model checking-based approach for location query processing in pervasive computing environments. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2007, Part II. LNCS, vol. 4806, pp. 866–875. Springer, Heidelberg (2007) 4. Ilarri, S., Mena, E., Bobed, C.: Processing location-dependent queries with location granules. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2007, Part II. LNCS, vol. 4806, pp. 856–865. Springer, Heidelberg (2007) 5. Skiena, S.S.: The Algorithm Design Manual. Springer, New York (2008) 6. van Kreveld, M.: Computational geometry: Its objectives and relation to GIS. Nederlandse Commissie voor Geodesie (NCG), pp. 1–8 (2006) 7. Jagadish, H., Ooi, B., Tan, K., Yu, C., Zhang, R.: iDistance: An adaptive B+tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems 30(2), 364–397 (2005)

Probabilistic Granule-Based Inside and Nearest Neighbor Queries

117

8. Ilarri, S., Mena, E., Illarramendi, A.: Location-dependent queries in mobile contexts: Distributed processing using mobile agents. IEEE Transactions on Mobile Computing 5(8), 1029–1043 (2006) 9. Trevisani, E., Vitaletti, A.: Cell-ID location technique, limits and beneﬁts: An experimental study. In: 6th IEEE Workshop on Mobile Computing Systems and Applications (WMCSA 2004), English Lake District, UK, December 2004, pp. 51– 60 (2004) 10. Wolfson, O., Jiang, L., Sistla, A.P., Chamberlain, S., Rishe, N., Deng, M.: Databases for tracking mobile units in real time. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 169–186. Springer, Heidelberg (1999) 11. Sistla, A., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Etzion, O., Jajodia, S., Sripada, S. (eds.) Dagstuhl Seminar 1997. LNCS, vol. 1399, pp. 310–337. Springer, Heidelberg (1998) 12. Xiao, L., Hung, E.: An eﬃcient distance calculation method for uncertain objects. In: Computational Intelligence and Data Mining (CIDM 2007), Honolulu, Hawaii, USA, April 2007, pp. 10–17 (2007) 13. Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Eﬃcient clustering of uncertain data. In: 6th International Conference on Data Mining (ICDM 2006), Hong Kong, pp. 436–445 (2006) 14. Beskales, G., Soliman, M.A., IIyas, I.F.: Eﬃcient search for the top-k probable nearest neighbors in uncertain databases. Proceedings of the VLDB Endowment 1(1), 326–339 (2008) 15. Mouratidis, K., Papadias, D., Bakiras, S., Tao, Y.: A threshold-based algorithm for continuous monitoring of k nearest neighbors. IEEE Transactions on Knowledge and Data Engineering 17(11), 1451–1464 (2005) 16. Zheng, B., Xu, J., Lee, W.C., Lee, L.: Grid-partition index: A hybrid method for nearest-neighbor queries in wireless location-based services. The VLDB Journal 15(1), 21–39 (2006) 17. Ilarri, S., Mena, E., Illarramendi, A.: Location-dependent query processing: Where we are and where we are heading. ACM Computing Surveys (to appear) 18. Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: ACM SIGMOD International Conf. on Management of Data (SIGMOD 2003), San Diego, California, USA, June 2003, pp. 551–562 (2003) 19. Tao, Y., Xiao, X., Cheng, R.: Range search on multidimensional uncertain data. ACM Transactions on Database Systems 32(3), 15 (2007) 20. Kriegel, H.P., Kunath, P., Renz, M.: Probabilistic nearest-neighbor query on uncertain objects. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 337–348. Springer, Heidelberg (2007) 21. Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2000), Dallas, Texas, USA, May 2000, pp. 189–200 (2000)

Window Update Patterns in Stream Operators Kostas Patroumpas1 and Timos Sellis1,2 1

2

School of Electrical and Computer Engineering, National Technical University of Athens, Hellas Institute for the Management of Information Systems, R.C. ”Athena”, Hellas {kpatro,timos}@dbnet.ece.ntua.gr

Abstract. Continuous queries applied over nonterminating data streams usually specify windows in order to obtain an evolving –yet restricted– set of tuples and thus provide timely results. Among other typical variants, sliding windows are mostly employed in stream processing engines and several advanced techniques have been suggested for their incremental evaluation. In this paper, we set out to study the existence of monotonicrelated semantics in windowing constructs towards a more eﬃcient maintenance of their changing contents. We investigate update patterns observed in common window variants as well as their impact on windowed adaptations of typical operators (like selection, join or aggregation), oﬀering more insight towards design and implementation of stream processing mechanisms. Finally, to demonstrate its signiﬁcance, this framework is validated for several windowed operations against streaming datasets with simulations at diverse arrival rates and window sizes.

1

Introduction

Continuous queries over data streams require near real-time responses in many mission critical applications, such as telecom fraud detection, stock exchange bids or traﬃc surveillance systems. Stream processing must keep up with the ﬂuctuating arrival rate of high-volume transient items, otherwise dropping unprocessed tuples is inevitable [17]. Therefore, it cannot be expected that fast in-memory computation could be performed over the entire stream, lest that available system resources would get rapidly exhausted. Windows are abstractions speciﬁed through distinctive properties inherent in the incoming data, aiming to provide ﬁnite stream portions for eﬃcient query processing. In fact, users typically submit queries that specify a particular period of interest, such as ”continuously identify the 10 stocks with the greatest drop during the past hour”. Typically, a sliding window is repetitively applied over the stream and returns its most recent items at any given time, e.g., items received in the past hour or the latest 1000 tuples. Several system prototypes have been implemented, such as AURORA [1], Borealis [2], Gigascope [12], STREAM [6], and TelegraphCQ [7], setting a foundation for Stream Processing Engines [17], also suggesting certain window types for specifying suitable stream subsets. In this work, we conduct a careful study of the intrinsic window semantics with respect to monotonicity, extending a recent analysis on sliding windows J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 118–132, 2009. c Springer-Verlag Berlin Heidelberg 2009

Window Update Patterns in Stream Operators

119

[10]. Usually, the contents of a window (known as its current state) change continuously over time, as fresh stream items get included while older elements are rejected (e.g., those having arrived more than an hour ago). Stream tuples may participate in many successive window states, but each current state does not necessarily subsume its preceding one, hence it appears that most typical windowing constructs are non-monotonic. Window speciﬁcation [16] is based on succession of streaming tuples, often associating suitable time indications to them. Thanks to such timestamps that control admission and expiration of tuples from the current state, windowing constructs show interesting repetitive patterns when refreshing their contents. For instance, an item that has just been inserted into an one-hour long sliding window, will be surely discarded from it after exactly one hour. Although this key observation cannot really eliminate the burden of non-monotonicity in windowing constructs, it may still facilitate their eﬃcient state maintenance. In addition, such patterns prove helpful when evaluating windowed operators. We speciﬁcally examine the impact of window updates on the results of respective relational operators, such as joins or aggregations over windows, and we oﬀer insight on their intrinsic semantics. We stress that our focus is on properties of individual operators, and not on entire query execution plans that raise issues beyond the scope of this paper (such as operator scheduling, data propagation and query plan optimization). In that respect, this work is orthogonal to recent eﬀorts towards development of stream processing engines. Overall, we consider windows as ﬁrst-class citizens in stream processing, fully integrated into a powerful set of operators that can express a wide range of continuous queries. In summary, the main contributions of our work are: – We provide a classiﬁcation of most typical window variants with respect to their update patterns. – We further suggest that the observed update behavior –although not strictly monotonic– may lead to nearly smooth maintenance of window states. – We investigate the implications of window updates on operators commonly applied in continuous queries. – We validate this framework against streaming datasets and demonstrate its beneﬁts in evaluation of windowed operators over evolving streams. The remainder of this paper is organized as follows. In Section 2, we outline essential notions on data streams and windows. Section 3 develops monotonicrelated semantics inherent in typical windowing constructs. Section 4 discusses the eﬀect of window update patterns on relational operators. Results from an experimental validation are reported in Section 5. Section 6 surveys recent work on stream processing using windows, whereas Section 7 concludes the paper.

2

Fundamentals on Windows over Data Streams

In this section, we outline the main principles concerning window speciﬁcation in continuous queries over data streams. In-depth discussion with algebraic notations and formal semantics can be found in our earlier work [16].

120

2.1

K. Patroumpas and T. Sellis

Abstract Semantics of Streams and Windows

Items of a data stream are usually considered as relational tuples with appropriate signs concerning either the time they were admitted to the system or their sequence number. In either case, such timestamps provide unique ordering reference for all tuples. Given that multiple recordings arrive for processing continuously, time indication and item succession play a vital role, because data stream elements should be given for query evaluation in accordance to a strict ordering. All timestamps are drawn from a common Time Domain T, which is an ordered, inﬁnite set of discrete time instants τ ∈ T. Apparently, T may be considered similar to the domain of natural numbers N. A typical assumption [3] is that a possibly large, but always bounded number of tuples from stream S arrive for processing at each timestamp τ ∈ T. Hence, even if the entire stream may be unbounded, its instantiation is considered ﬁnite at any single time τ . We also assume that a timestamp attribute Aτ is attached to the schema of tuples and takes its ever-increasing values from T. Since a stream may be considered as an ordered sequence of data items evolving in time, its current contents at time τi ∈ T are all tuples accumulated so far, i.e., S(τi ) = {s ∈ S : s.Aτ ≤ τi }. Given that query processing should be carried out in main memory so as to meet real-time requirements, only a restricted number of stream tuples can be maintained each time. It is exactly this ﬁnite portion of stream contents that can be considered at each evaluation; such an operation is abstracted through a windowing construct by setting speciﬁc constraints over time, number of tuples or other stream properties. A windowing attribute [14] is necessary for establishing order among stream items, and timestamps serve this purpose perfectly. In essence, at each time instant τi ∈ T, the current window state WE (S(τi )) makes up a temporary relation of a ﬁnite set of tuples qualifying to constraints set by the window speciﬁcation. In [16] we suggested a family of ﬂexible scope functions for determining the exact structure of a window through: (i) its timevarying lower and upper bounds, (ii) the window extent (”size”), and (iii) its adjustment across time. Typically, such a function takes as arguments a time interval of interest or a speciﬁc tuple count and returns an ever-changing set of stream items as time evolves (e.g., all tuples received during the past hour or the most recent 100 tuples). More details can be found in [16]. 2.2

Commonly Used Window Types

Next, we summarize the basic features of typical tuple- and time-based windowing constructs that are mostly utilized in stream processing: Count-based Windows. A typical count-based window returns each time the N most recent tuples of stream S (Fig. 1a). In practice, such a sliding ﬁxed-count extent is accomplished by discarding the most remote item from current window state so as to accommodate the newly arrived one, provided that each tuple is assigned a unique sequence number upon admission to the system [3].

Window Update Patterns in Stream Operators

121

Fig. 1. Window states for tuple-based and time-based variants at two instants τ and τ (or τ + β). (a) Count-based sliding window of size N tuples. (b) Partitioned window with N tuples at each partition. (c) Landmark window with lower-bound τ . (d) Timebased sliding window of extent ω with step β. (e) Tumbling window with β = ω.

Partitioned Windows. This demultiplexing operator implies that the entire stream S is subdivided into several substreams S1 , S2 , . . . according to the values of certain grouping attributes. At each time instant τ ∈ T, the N most recent tuples are taken from each resulting substream Si as its contribution to the overall window state, which is derived as the union of partial results (Fig. 1b). For example, to identify current trends for a ﬁnancial application, moving aggregates can be computed by obtaining the 1000 most recent stock readings from each particular sector (i.e., industry, banking, telecommunications etc.). Landmark Windows. Such windows have their lower or upper bound ﬁxed at a speciﬁc time instant τ (”landmark”), letting the other bound follow the evolution of time, e.g., ”get all recordings collected after 10 p.m.” Thus, newly arriving tuples are simply appended to the window state without discarding existing ones (Fig. 1c). Potentially, this window state will be steadily expanding until either it gets explicitly revoked or all stream items are entirely consumed. Time-based Sliding Windows. This construct is perhaps the most widely used in continuous queries over streams. The state of such a window is speciﬁed by a ﬁxed-size temporal extent ω, usually the most recent time interval (e.g., ”continuously return all stream items of the past hour”), by appending fresh tuples and discarding older ones on the basis of their time indications

122

K. Patroumpas and T. Sellis

(Fig. 1d). Stream arrival rate may ﬂuctuate, so a diﬀerent number of items may be returned at any given time. Typically, the window slides at a unit step β = 1 (i.e., every instant) leading to overlaps between successive states, although a multi-hop progression step 1 < β < ω is also an option. Time-based Tumbling Windows. When progression step β of a sliding window is greater than or equal to its temporal extend ω, then each state returns disjoint ”batches” of tuples every β time instants (Fig. 1e). In case β = ω, a new window state is created as soon as the previous one gets discarded. Hence, each tuple is allowed to take part in calculations (e.g., aggregates) only once. For instance, average network traﬃc may be computed every 30 minutes, considering all packets transferred within the past half hour (β = 30 min).

3

Examining Monotonic-Related Window Semantics

Next, we consider the notion of monotonicity with respect to results of continuous queries and examine several issues raised from usage of windows. We then proceed to investigate the update patterns observed in typical window variants. 3.1

Monotonicity of Continuous Queries

A large class of continuous queries involve append-only results, thus not allowing any deletions or updates at answers already returned. This approach may cover some operators (like projection or selection) that act as simple ﬁlters over current stream tuples, just delivering any qualifying items to an output stream. For instance, sensor readings indicating temperatures ≥ 30o C can be immediately identiﬁed as they ﬂow into the system; these items are returned as answers without further processing. Such stateless operators do not delete tuples already emitted in their output stream, hence their incremental evaluation is possible: no past items need be retained from the transient stream contents. Formally: Definition 1. Continuous query Q over data stream S is called monotonic when ∀ τ1 , τ2 ∈ T, τ1 ≤ τ2 , if S(τ1 ) ⊆ S(τ2 ), then Q(S(τ1 )) ⊆ Q(S(τ2 )), where Q(S(τi )) denotes results produced at time τi from qualifying tuples of stream contents S(τi ). Unfortunately, not all continuous queries are monotonic. Some blocking operators, like aggregation or sorting, are unable to produce any results unless they consume the entire input. Besides, stateful operators, like join or intersection, should maintain tuples from their input streams, in order to guarantee that a fresh item from either stream would still be able to match an older tuple from the other one [19]. To remedy such intricacies, windows have been devised as a means of providing bounded datasets to query operators. However, when applying windows over streams, non-monotonic results are generally returned. For instance, when a sliding window is applied over a stream, some new tuples are included in the current state of that window, but its oldest

Window Update Patterns in Stream Operators

123

contents expire due to window movement (Fig. 1d). Expiration timestamps may be attached to operator results, denoting their validity interval. Another interesting technique for evaluating sliding window queries is to introduce negative tuples [9] as a means of cancelling previously emitted, but no longer valid results. In general, non-monotonic queries may have results that expire at unpredictable times; in case of an expiring item p, a negative tuple p− must be generated as an artiﬁcial copy of p, so as to signify its removal from the result. Nonetheless, this policy requires reengineering of the entire query evaluation process, since operators should be considerably enhanced in order to consume both positive (regular) and negative (cancellation) tuples. A meticulous study on monotonicity of query operators coupled with sliding windows has been presented in [10], also proposing a classiﬁcation of such continuous queries according to their update patterns. Besides strictly monotonic and non-monotonic operators, two other categories were suggested in between: – In weakest non-monotonic operators, results get appended to and discarded from the output stream in a FIFO fashion. – Weak non-monotonic operators do not generally show a FIFO pattern in the way their results expire, but expiration times can be determined for all results without emitting negative tuples. Apparently, the actual contents of window states vary according to the inherent semantics of each particular variant. In fact, the actual state depends on whether window’s bounds change, as well as on their progression with time. By examining the current values of window speciﬁcation at any given time, containment and expiration of timestamped tuples can be easily decided, thus providing the temporary window state. In the following, we develop the monotonic-related characterization even further, extending it to cover all typical window variants. 3.2

Update Patterns of Sliding Windows

Time-based sliding windows are characterized as weakest non-monotonic in [10]. Indeed, new tuples may be appended to the current window state, while pushing out some older items. But, depending on the progression step β between successive evaluations at times τi−1 and τi , common tuples may be found in the corresponding states (Fig. 1d). It can be easily proven the following Proposition 1. Between successive states of a time-based sliding window with β < ω, assuming that τi = τi−1 + β, itholds that: | WE (S(τi−1 )) WE (S(τi )) |≥ 0. Despite such continuous change in window states, ordering among qualifying tuples is always preserved, since items are included into and excluded from the sliding window in a ﬁrst-in-ﬁrst-out (FIFO) fashion. If τ , τ respectively denote the original timestamp value and expiration time of a stream tuple s, then: Proposition 2. For state WE (S(τi )) at time ti ∈ T of a sliding window over data stream S, it holds that: ∀ s1 , s2 ∈ WE (S(τi )), s1 .τ ≤ s2 .τ ⇔ s1 .τ ≤ s2 .τ .

124

K. Patroumpas and T. Sellis

Similar remarks are valid for count-based variants, given that admission timestamps and expiration times adhere to identical orderings for all tuples. Since window’s extent is ﬁxed (expressed either as N tuples or as ω time units), the ordering of expiration times is actually equivalent to shifting (by N or ω, respectively) the original succession of timestamps attached to streaming items. 3.3

Update Patterns of Tumbling Windows

As prescribed by tumbling window semantics, every state ceases to exist in its entirety upon initiation of its succeeding one (Fig. 1e). Hence: Proposition 3. No common items occur between successive states of a tumbling window at time instants τi−1 and τi : WE (S(τi−1 )) WE (S(τi )) = ∅. Despite appearances, discussion about monotonicity for a tumbling window cannot be ruled out altogether. Indeed, expirations occur every β time units and are known beforehand for each state and all their tuples. Within a state, though, tuples continue to pile up without removing any previous item. Therefore, the contents of a single state grow monotonically until their simultaneous expiration. Meanwhile, in terms of eﬃcient maintenance, a tumbling construct may be loosely considered as a sliding one with the same extent; its bounds shift in unison at each time unit (e.g., every second or at each new timestamp value). The only diﬀerence is that a tumbling window discloses its state periodically, as speciﬁed by its progression step β. A practical evaluation policy would be to remove a tuple participating in the current state at the same order it was originally inserted, emulating a sliding window pattern with some kind of deferred elimination of tuples in plain FIFO fashion. Hence, as autonomous operators, tumbling windows may be considered weakest non-monotonic, due to their intrastate monotonicity and implicit resemblance to sliding windows. 3.4

Update Patterns of Partitioned Windows

The partitioned variant essentially applies a sliding construct against each of the partial substreams S1 , S2 , . . . in which the original stream S is being demultiplexed. Therefore, the contents in each of the N -sized constituent partitions change in FIFO order. Still, it should not be expected that the obtained tuples change at the same regular rate for each partition, because it may occur that some partition Si has much older tuples compared to another partition Sj . Some combinations of values (e.g., those deﬁning Sj ) on the grouping attributes may be observed frequently while some others less often (e.g., for Si ), depending on the actual patterns detected in the incoming tuples (Fig. 1b). It is also possible that during some time interval no fresh tuples qualify for a partition, which still retains its state as created from items received long ago. Since overall window state is always obtained from the union of all partitions, it turns out that expiration order of items does not generally coincide to their

Window Update Patterns in Stream Operators

125

succession in timestamp values, not even to their insertion order into the partitioned window. Yet, the expiration time of each tuple can be determined with respect to the partition it belongs to. So, partitioned windows can be considered as weak non-monotonic, and do not necessitate generation of negative tuples. 3.5

Update Patterns of Landmark Windows

Contrary to other types, lower- and upper-bounded landmark windows are strictly monotonic, since no tuple is ever removed from any window state. Hence: Proposition 4. For a landmark window with a lower-bound at time instant τl ∈ T, it holds that: ∀ τ1 , τ2 ∈ T, τl ≤ τ1 ≤ τ2 , WE (S(τ1 )) ⊆ WE (S(τ2 )). This means that the window state at an arbitrary time instant clearly subsumes its previous ones, practically containing all their tuples. The situation is similar with upper-bound landmarks. In the trivial case that both bounds remain unchanged over time, the resulting ﬁxed-band windows [16] are also monotonic.

4

Impact on Windowed Operators

When windows are intertwined with typical query operators, each variant presents its own challenges with respect to evaluation. The crucial diﬀerentiation with relational operators is that results must be continuously renewed, keeping in pace with any changes in window state(s). For example, a windowed join should check for matching tuples between the windows applied over its source streams [9,16]. Next, we investigate the repercussions that arise on typical operators due to monotonic-related semantics of their associated windowing constructs. We W consider most common windowed operators [16]: projection (πL ) of attributes W listed in L, selection (σF ) with conjunctive criteria F , duplicate elimination (δ W ), W ∪ fW binary join (1 W ), union ( W ), diﬀerence ( ), and aggregation (γ L ) with function f (e.g. MAX) based on grouping attributes L. Table 1 summarizes the resulting classiﬁcation, as derived from the following discussion for each window variant. 4.1

Operators over Sliding Windows

Summarizing the argumentation developed in [10], selection, projection, and merge-union over sliding windows are easily proven weakest non-monotonic, since expiration of tuples in the answer is decided according to a FIFO order. When sliding windows are combined to join, aggregation and duplicate elimination, these operators yield weak non-monotonic results. Indeed, joined tuples do not expire in the same order they were produced, because their validity depends on the original timestamp values of their constituent items. For example, a joined result p could expire prior to another tuple q produced earlier, in case one of the tuples that generated p expires from window state. In aggregation, some groups may be updated more often than others. As for duplicate elimination, a tuple r may be appended to the answer long after inclusion of that r in the window state, due to expiration of an earlier item r from the current result.

126

K. Patroumpas and T. Sellis Table 1. Monotonic-related classiﬁcation of typical windowed operators

Window variants

Monotonic

Weakest non-monotonic

Weak non-monotonic

∪

W W σF , πL ,W

sliding

δW ,

1 W,

W

W

W

W

f γL

∪

1 W,

f γL

W W σF , πL , W , δW , f γL

W

1 W,

partitioned 1 W,

W

∪

W W σF , πL , W , δW ,

W W ∪ σF , πL , W , δW ,

W

f γL

tumbling

landmark

Non-monotonic

W

Only diﬀerence (negation) was shown to be strictly non-monotonic due to explicit deletions. Note that these characterizations are valid for operators associated with either count-based or time-based sliding constructs. 4.2

Operators over Tumbling Windows

As already mentioned, the state of a tumbling window is made periodically available, so operators apply their semantics over this temporarily ”frozen” relation. Since no state shares any tuples with its predecessor, it turns out that results produced by such windowed operations will be disjoint for successive evaluations. Nonetheless, a more ﬂexible policy may be applied when evaluating tumblingwindowed operators. Every β time units, a new window state is created and it is initially empty. Subsequently, though, stream tuples are only appended, without deleting any items. As explained in Section 3.3, this pattern is clearly monotonic, but lasts for exactly β time units, when a new state takes its place. Operators like projection, selection, join, and duplicate elimination, could take advantage of this pattern and produce results incrementally during each state. No FIFO pattern is observed even for simple ﬁltering (e.g., selection), because when a state terminates all current results are invalidated at once. However, due to ﬁxed progression step β, expiration times of any resulting tuple can be easily predicted; this answer remains valid until the end of the current β-sized period. Aggregation can also be handled eagerly without use of negative tuples, by properly updating groups when a new item arrives in the current state, as a means of defeating its blocking behavior. Thus, most operators on tumbling windows are weak non-monotonic thanks to periodicity in expiration times. In contrast, diﬀerence between two tumbling windows W1 and W2 is proven strictly non-monotonic. Indeed, as soon as a new tuple s arrives in W1 and is not present in W2 , it is added to the result. If a similar item s appears in W2 and matches the existing s anywhere in W1 , the latter item must be removed from the answer; but the series of such removals does not generally coincide with the insertion order of tuples in the result. Note that the case can be easily generalized for multisets, by properly adjusting tuple multiplicities. In general, this problematic behavior of diﬀerence can be attributed to its intrinsic semantics, which makes negative tuples indispensable for all its windowed versions: Proposition 5. Diﬀerence between any window variants is non-monotonic.

Window Update Patterns in Stream Operators

4.3

127

Operators over Partitioned Windows

Partitioned windows, which are weak non-monotonic as autonomous operators, trivially retain this property when combined with simple ﬁlters like selection and projection or when unifying states with merge-union. The same behavior is observed for duplicate elimination, since any duplicate tuples will always appear in identical partitions, hence their expiration can be predicted. When joining two partitioned constructs the answer is weak non-monotonic, since a resulting tuple will be removed as soon as at least one of its constituent items ceases to exist in a partition, a fact can be clearly determined without generating negative tuples. Aggregation is also weak non-monotonic: of course, some groups may get updated more frequently, exactly like some of their underlying partitions; but when a tuple perishes from a group, it must also have been expelled from its partition as well, and this latter expiration order is known. As already mentioned, diﬀerence is strictly non-monotonic. 4.4

Operators over Landmark Windows

Since landmark constructs are always monotonic, adjoined operators like selection, projection, merge-union, and duplicate elimination over them can only produce append-only output tuples, since no results can ever expire. Similar is the situation concerning landmark-windowed join and aggregation, as they also prove consistently monotonic. Speciﬁcally, in the case of a join, a new tuple x in either window may match an older item y from the other window, but the joined result will remain valid for ever. As for aggregation, it is possible that a group could receive more items, but as soon as a tuple gets assigned to a group, it shall never be erased from it; without further complications, function f (MAX, SUM, etc.) gets reapplied over that group. On the contrary, diﬀerence between two landmark windows is strictly nonmonotonic, because expiration of output tuples can occur at arbitrary times, whenever a match is found between the growing contents of both input states. 4.5

Binary Operators over Diverse Windows

Up to this point, we assumed that both inputs to binary operators (union, join, diﬀerence) were similar window types, although they may diﬀer in their parametric properties (e.g., temporal extent, progression step). However, it may seldom occur that diﬀerent window types are speciﬁed in queries; for example, let a join between a landmark and a tuple-based sliding window, like ”compare stock prices received after 10 a.m. [landmark] against the 1000 most recent bond sells [sliding count-based] provided that matching pairs aﬀect the same companies”. Although many combinations may arise, it can be easily veriﬁed the following: Proposition 6. The more strongly non-monotonic a window is, the more it dominates the behavior of the binary operator. Thus, the query above is weak non-monotonic since join involves a sliding window that inﬂuences the result more than its monotonic landmark counterpart.

128

5

K. Patroumpas and T. Sellis

Experimental Validation

In this section, we report on our eﬀort to attest the validity of the semantics framework developed, with particular emphasis on window update patterns. Of course, such a framework lacks proper optimization and ﬁne-tuning and certainly cannot be compared to full-ﬂedged stream processing engines [2,6,7]. As ours is not a complete stream processing engine and we cannot evaluate the performance of composite continuous queries according to the Linear Road Benchmark [4], we focus on the behavior of individual operators. We stress that our main concern was to take advantage of monotonic-related patterns and maintain eﬃciently the state of autonomous operators, rather than execute complete query plans involving a series of interconnected operators. Accordingly, windowed adaptations [16] of relational operators were implemented as separate classes in C++ for all variants described in Section 2.2. Operators were abstracted as iterators consuming tuples from their input queue and feeding with results their output queue. Queues can generally serve as a means for inter-operator connectivity, but also for incremental availability of ordered results. Hash tables were maintained for each window state, in order to speed up retrieval and update eﬃciency. Experiments were performed on a Pentium IV 2.54 GHz CPU with 512 MB of main memory running GNU/Linux. We generated two synthetic datasets, each with a total of 1,000,000 tuples, and we supplied them as separate streams to our processing framework at diverse arrival rates. As a rule in data stream processing, we adhere to online in-memory computation, excluding the use of hard disk for performance reasons. Input tuples were always received in timestamp order, so stream imperfections such as delayed or out-of-order items cannot occur [17]. A quite valuable feature of our framework refers to the practically negligible cost of maintaining autonomous window states, thanks to their intrinsic update patterns. Each such construct is implemented as a virtual queue over the input stream, so no tuples need be copied or deleted when window bounds change. Even partitioned windows, despite that their state may involve detached tuples and not cohesive stream chunks (Fig. 1b), they really incur very low overhead because each incoming item aﬀects just the head and tail of a single partition. But it is the combination of windows to relational operators that really matters. For brevity, performance results are only shown for the more demanding operators (join, aggregation, duplicate elimination) involving count-based sliding and landmark windows, each one executed in isolation. All other window types demonstrate behavior similar to count-based ones due to their intrinsic sliding nature. As plotted in Fig. 2, sliding window joins are particularly timeconsuming, compared to the other two operators applied over similar-sized windows. This is quite expectable, because the more items in either window state, the more opportunities for matching tuples. In contrast, execution times for aggregation and duplicate elimination are low and remain almost stable, irrespective of window extent. For these windowed operators, only a local arrangement (by adjusting a group or replacing a tuple in current state) suﬃces to refresh results.

Window Update Patterns in Stream Operators

129

6

4

50

Join Aggregation Duplicate elimination

40 Time (sec)

Time (sec)

5

3 2

30 20 10

1 0

Rate=10K Rate=20K Rate=50K Rate=100K

0

10 20 30 40 50 60 70 80 90 100 Window extent (K tuples)

Fig. 2. Performance of sliding operators

10 20

50 Window extent (K tuples)

100

Fig. 3. Sliding joins at diverse rates

We simulated a wide range of arrival rates for input tuples, in order to validate operator robustness when coping with ﬂuctuating streams. Increased arrival rates of streaming items have great impact on the performance of costly operators like join. As shown in Fig. 3, at various input rates (up to 100,000 tuples/sec), joins have almost linearly rising execution time when greater extents are speciﬁed for their sliding windows, since they potentially generate more matches between states. This phenomenon is aggravated for landmark window joins (Fig. 4) that entail a growing state: indeed, as more items get gradually accumulated, more and more matching tuples are possible to exist. Operator state maintenance incurs additional overhead. Landmark windowed operators show a linearly increasing memory footprint as more tuples get included into their state. As depicted in Fig. 5, memory cost depends on operator type; join is very demanding, because it requires state maintenance for both its input windows. Of course, memory consumption naturally increases as landmark window size steadily grows, but it still remains at reasonable levels (less than 8 Mbytes for 1,000,000 items). The situation with aggregates is similar, but almost with half costs, since only one hash table needs to be maintained. In contrast, duplicates get discarded immediately from state, so their memory requirements 8

Time (sec)

40

Join Aggregation Duplicate elimination

30 20 10

100 200 300 400 500 600 700 800 900 1000 Window extent (K tuples)

7 Memory (Mbytes)

50

6

Join Aggregation Duplicate elimination

5 4 3 2 1 100 200 300 400 500 600 700 800 900 1000 Window extent (K tuples)

Fig. 4. Evaluation of landmark operators Fig. 5. Space cost of landmark operators

130

K. Patroumpas and T. Sellis

are almost negligible. Not surprisingly, with window types other than landmark ones, we observed that memory nearly stabilizes to a limited amount that depends only on window extents, assuming a constant input rate (results are trivial and thus omitted). Overall, instances of successive operator states empirically veriﬁed that window and operator semantics perform exactly as expected and always get updated in the prescribed manner under diverse stream rates and window sizes.

6

Related Work

Data management community has devoted extensive research eﬀorts in the context of data streams over the last decade. Of the most prominent prototypes, AURORA [1] and its distributed version Borealis [2] are data ﬂow oriented systems and utilize sliding and tumbling windows for aggregation, as well as for computing joins and sorted output results. STREAM [6] oﬀers a Continuous Query Language (CQL) equipped with time-based and tuple-based sliding windows with resource sharing capabilities [5], as well as partitioned windows adhering to the SQL:1999 standard [3]. As for TelegraphCQ [7], only time-based sliding windows are available in its StreaQuel language, but support for a range of windowing constructs (landmark, tumbling) could also be achieved. No windowing constructs are explicitly speciﬁed in Gigascope [12], but their semantics are indirectly expressed as constraints involving monotonically increasing timestamps of input streams. Punctuations [19,20] were introduced as an unblocking technique instead of windowing constructs, by suitably embedding special signs in the stream that denote the end of a subset of data. Gigascope regularly generates punctuations (”heartbeats”) in order to unblock operators in query plans. Taking advantage of research foundations from the academia, several industrial products have also begun to oﬀer stream processing capabilities. StreamBase platform [18] builds on the experience of Aurora and Borealis prototypes and provides real-time event processing, oﬀering a mature StreamSQL language for specifying continuous queries; their evaluation is performed on a tuple-driven basis, according to the arrival order of streaming items. In contrast, Oracle’s Complex Event Processor [15] uses an extension of CQL that adheres to a timedriven scheme, where window states change according to the timestamp values of incoming tuples. Coral8 [8] in its Continuous Computation Language (CCL) employs stream manipulation according to both tuple- and time-driven approaches and utilizes windowing constructs that can be shared by multiple queries. Laying the foundations for a streaming SQL standard, a hybrid data model was recently proposed in [11], which attempts to bridge the gap between tupledriven and time-driven semantics. The underlying concept is that evaluation emanates from the arrival of a batch of tuples, that can be either items of identical timestamp value or distinctly ordered tuples. Its novel stream-to-stream operator SPREAD can provide ﬁne-grained control over ordering relationships among tuples such that the conﬂicting demands of simultaneity and ordering can both be captured.

Window Update Patterns in Stream Operators

131

Regarding windowed stream processing, the interesting idea of negative tuples [9] was suggested for evaluating sliding window queries, as a means of cancelling results that are no longer valid, at the expense of drastically revised operator semantics. Several optimization techniques were applied for reducing the overhead of doubling the amount of tuples processed and for avoiding output delays. Besides, a detailed examination of stream aggregates was proposed in [14]; under this interpretation, windowed aggregation reduces to a simple relational one. A temporal stream algebra [13] covers sliding and ﬁxed windows only, distinguishing logical and physical operator levels for query speciﬁcation and evaluation. To the best of our knowledge, update patterns for sliding window operators have been analyzed thus far only in [10], aiming to exploit such properties in operator evaluation at physical execution plans. This important work is perhaps the closest in spirit with our own, also introducing a suitable classiﬁcation of operators with respect to monotonicity. However, the objective in [10] is entirely focused on time-based sliding windows. In this paper, we develop the generalization of similar monotonic-related patterns to all frequently used window variants, also investigating their impact on typical query operators. Our approach is based on rigorous semantics of a rich set of window types [16], in an attempt to guide operator evaluation through sound algebraic constructs and not just heuristics mainly geared towards eﬃciency.

7

Concluding Remarks

In this paper we exhibit the signiﬁcance of windows in continuous query evaluation, by deeply understanding several update patterns intrinsically tied to window semantics. Although most such constructs do not refresh their state in a monotonic fashion, we show that opportunities still exist for signiﬁcant savings in their eﬃcient maintenance, to the beneﬁt of advanced stream processing. Further improvement is possible with respect to shared evaluation and query optimization in presence of overlapping window states. In line with eﬀorts towards foundation of a stream algebra, we also plan to examine properties concerning multiple windows and query rewriting rules in composite execution plans.

References 1. Abadi, D.J., Carney, D., C ¸ etintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a New Model and Architecture for Data Stream Management. VLDB Journal 12(2), 120–139 (2003) 2. Abadi, D.J., Ahmad, Y., Balazinska, M., C ¸ etintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A.S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The Design of the Borealis Stream Processing Engine. In: CIDR (January 2005) 3. Arasu, A., Babu, S., Widom, J.: The CQL Continuous Query Language: Semantic Foundations and Query Execution. VLDB Journal 15(2), 121–142 (2006)

132

K. Patroumpas and T. Sellis

4. Arasu, A., Cherniack, M., Galvez, E., Maier, D., Maskey, A., Ryvkina, E., Stonebraker, M., Tibbetts, R.: Linear Road: A Stream Data Management Benchmark. In: VLDB, September 2004, pp. 480–491 (2004) 5. Arasu, A., Widom, J.: Resource Sharing in Continuous Sliding-Window Aggregates. In: VLDB, September 2004, pp. 336–347 (2004) 6. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: ACM PODS, May 2002, pp. 1–16 (2002) 7. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Raman, V., Reiss, F., Shah, M.A.: TelegraphCQ: Continuous Dataﬂow Processing for an Uncertain World. In: CIDR (January 2003) 8. Coral8 Inc. Continuous Computation Language (CCL) Reference. Documentation (2008), http://www.coral8.com/WebHelp/coral8_documentation.htm 9. Ghanem, T., Hammad, M., Mokbel, M., Aref, W., Elmagarmid, A.: Incremental Evaluation of Sliding-Window Queries over Data Streams. IEEE Transactions on Knowledge and Data Engineering 19(1), 57–72 (2007) ¨ 10. Golab, L., Tamer Ozsu, M.: Update-Pattern-Aware Modeling and Processing of Continuous Queries. In: ACM SIGMOD, June 2005, pp. 658–669 (2005) 11. Jain, N., Mishra, S., Srinivasan, A., Gehrke, J., Widom, J., Balakrishnan, H., C ¸ etintemel, U., Cherniack, M., Tibbetts, R., Zdonik, S.: Towards a Streaming SQL Standard. In: VLDB, August 2008, pp. 1379–1390 (2008) 12. Johnson, T., Muthukrishnan, S., Shkapenyuk, V., Spatscheck, O.: A Heartbeat Mechanism and its Application in Gigascope. In: VLDB, September 2005, pp. 1079–1088 (2005) 13. Kr¨ amer, J., Seeger, B.: A Temporal Foundation for Continuous Queries over Data Streams. In: COMAD, January 2005, pp. 70–82 (2005) 14. Li, J., Maier, D., Tufte, K., Papadimos, V., Tucker, P.: Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In: ACM SIGMOD, June 2005, pp. 311–322 (2005) 15. Oracle Inc. Complex Event Processing in the Real World. White paper (September 2007), http://www.oracle.com/technologies/soa/docs/oracle-complex-eventprocessing.pdf 16. Patroumpas, K., Sellis, T.: Window Speciﬁcation over Data Streams. In: Grust, T., H¨ opfner, H., Illarramendi, A., Jablonski, S., Mesiti, M., M¨ uller, S., Patranjan, P.L., Sattler, K.-U., Spiliopoulou, M., Wijsen, J. (eds.) EDBT 2006. LNCS, vol. 4254, pp. 445–464. Springer, Heidelberg (2006) 17. Stonebraker, M., C ¸ etintemel, U., Zdonik, S.: The 8 Requirements of Real-Time Stream Processing. ACM SIGMOD Record 34(4), 42–47 (2005) 18. StreamBase Systems. StreamSQL Guide. Documentation (2009), http://www.streambase.com/developers/docs/sb62/pdf/streamsql.pdf 19. Tucker, P., Maier, D., Sheard, T., Fegaras, L.: Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Transactions on Knowledge and Data Engineering 15(3), 555–568 (2003) 20. Tucker, P., Maier, D., Sheard, T., Stephens, P.: Using Punctuation Schemes to Characterize Strategies for Querying over Data Streams. IEEE Transactions on Knowledge and Data Engineering 19(9), 1227–1240 (2007)

Systematic Exploration of Eﬃcient Query Plans for Automated Database Restructuring Maxim Kormilitsin1 , Rada Chirkova1 , Yahya Fathi2 , and Matthias Stallmann1 1 2

Computer Science Department, NC State University, Raleigh, NC 27695 USA [email protected], [email protected], matt [email protected] Operations Research Program, NC State University, Raleigh, NC 27695 USA [email protected]

Abstract. We consider the problem of selecting views and indexes that minimize the evaluation costs of the important queries under an upper bound on the disk space available for storing the views/indexes selected to be materialized. We propose a novel end-to-end approach that focuses on systematic exploration of plans for evaluating the queries. Speciﬁcally, we propose a framework (architecture) and algorithms that enable selection of views/indexes that contribute to the most eﬃcient plans for the input queries, subject to the space bound. We present strong optimality guarantees on our architecture. Our algorithms search for sets of competitive plans for queries expressed in the language of conjunctive queries with arithmetic comparisons. This language captures the full expressive power of SQL select-project-join queries, which are common in practical database systems. Our experimental results demonstrate the competitiveness and scalability of our approach.

1

Introduction

Selecting and precomputing indexes and materialized views, with the goal of improving query-processing performance, is an important part of database-performance tuning. The signiﬁcant complexity of this view- and index-selection problem may result in high total cost of ownership for database systems. In recognition of this challenge, software tools have been deployed in commercial DBMS, including Microsoft SQL Server [1, 2, 3, 4] and DB2 [5, 6, 7], for suggesting to the database administrator views and indexes that would beneﬁt the evaluation eﬃciency of representative workloads of frequent and important queries. In this paper we propose a novel end-to-end approach that addresses the above view- and index-selection problem. Our speciﬁc optimization problem, which we refer to as ADR (for Automated Database Restructuring [8]), is as follows: Given a set of frequent and important queries, generate a set of evaluation plans that provides the lowest evaluation costs for the input queries on the given database. Each plan requires the materialization of a set of views and/or indexes, and cannot be executed unless all of the required views and indexes are materialized. The total size of the materialized views and indexes must not exceed a given space (disk) bound. This version of the view- and index-selection problem is J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 133–148, 2009. c Springer-Verlag Berlin Heidelberg 2009

134

M. Kormilitsin et al.

NP-hard [9] and is diﬃcult to solve optimally even when the set of indexes and views mentioned in the input query plans is small. In dealing with the important and extensively studied problem of view and index selection under a storage bound, the novelty of our approach is twofold. First, our problem statement concentrates on ﬁnding eﬃcient (possibly view- and index-based) query-evaluation plans for the input queries, as opposed to ﬁnding individual views or indexes without regard to the eﬃciency of the plans that they could contribute to. As such, unlike previous approaches (cf., e.g., [2, 3]), our approach quantiﬁes the “beneﬁts” of views or indexes to be materialized precisely by the evaluation costs of the plans they can participate in. Second, in our approach we focus on systematic exploration of plans for evaluating the given frequent and important queries. Speciﬁcally, we propose a framework (architecture) and algorithms that enable selection of views and indexes that contribute to the most eﬃcient plans for the input queries, subject to the space bound. Our generic architecture has two stages: (1) A search for sets of competitive plans for the input queries, and (2) Selection of one eﬃcient plan for each input query. The output (in the view/index sense1 ) is guaranteed to satisfy the input space bound. We present strong optimality guarantees on this architecture. Notably, the plans in the inputs to and outputs of the second stage are formulated as sets of IDs of views/indexes whose materialization would permit evaluation of the plans in the database. As such, stage one of our architecture encapsulates all the problem-speciﬁc details, such as the query language for the input queries or for the views/rewritings allowed for consideration in constructing a solution. The speciﬁc algorithms we propose in this paper search for sets of competitive plans for queries expressed in the language of conjunctive queries with arithmetic comparisons (CQACs); this language captures the full expressive power of SQL select-project-join queries, which are common in practical database systems. Our algorithms generate CQAC query-evaluation plans that use CQAC views. Our experimental results demonstrate that (a) our approach outperforms that of [3] when we use our algorithms of Sect. 4; and (b) a CPLEX [10] implementation of stage two of our architecture is scaleable to very large problem inputs. Related Work It is known that in selecting views or indexes that would improve query-processing performance, it is computationally hard to produce solutions that would guarantee user-speciﬁed quality (in particular, globally optimum solutions) with respect to all potentially beneﬁcial indexes and views. In general, reports on past approaches, including those for Microsoft SQL Server [1, 2, 3, 4] and DB2 [5, 6, 7], concentrate on experimental demonstrations of the quality of their solutions. A notable exception is the line of work in [11, 12, 13]. Unfortunately, in 1999 Karloﬀ and colleagues [14] disproved the strong performance bounds of these algorithms, by showing that the underlying approach of [13] cannot provide the stated worst-case performance ratios unless P=NP. Please see [15] for a detailed 1

To improve readability, in the remainder of the paper we focus on view selection. Extension to index selection is straightforward; see Sect. 6 for a discussion.

Systematic Exploration of Eﬃcient Query Plans

135

discussion of past work that centers on OLAP solutions, including [11, 13]. In this paper we focus on the problem of view and index selection for query, view, and index classes that are typical in a wide range of practical (either OLTP or OLAP) database systems, rather than limiting ourselves to just OLAP systems. In 2000, [2] introduced an end-to-end framework for selection of views and indexes in relational database systems; the approach is based partly on the authors’ previous work on index selection [16]. We have shown [17] that it is possible to improve on the solution quality of the heuristic algorithm of [2]. In this paper we focus on experimental comparisons of the contributions of this current paper with the approach of [3], which builds on [2] while focusing on a diﬀerent way of both deﬁning and selecting indexes and views. Our methods can also be combined with the approaches of [4, 18], which consider the problem of evolving the current physical database design to meet new requirements. Papers [19, 20] by Roy and colleagues report on projects in multiquery optimization. [20] introduced heuristic algorithms for improving query-execution costs in this context, by coming up with query-evaluation plans that reuse certain common subexpressions. [19] developed an heuristic approach to ﬁnding plans for maintenance of a set of given materialized views, by exploiting common subexpressions between diﬀerent view-maintenance plans. The focus of [19] is on eﬃcient maintenance of an existing conﬁguration of views, while we construct optimal conﬁgurations of views and indexes to ensure eﬃcient execution of the given queries, by systematic exploration of view- and index-based plans. Bruno and colleagues [21] proposed an algorithm that continuously modiﬁes the physical database design in reaction to query-workload changes. [22] introduced a language for specifying additional constraints on the database schema. The framework proposed in [22] allows a database administrator to incorporate the knowledge of the constraints into the tuning process. Other related work includes genetic algorithms – see the full version [23] for a detailed discussion.

2

Preliminaries

Our optimization problem ADR (for Automated Database Restructuring [8]) is as follows: Given a set of frequent and important queries on a relational database, generate a set of evaluation plans that provides the lowest evaluation costs for the queries on the database. Each plan requires the materialization of a set of views, and cannot be executed unless all of the required views are materialized. The total size of materialized views must not exceed a given space (disk) bound. Formally, an instance of the problem ADR is a tuple (Q, B, S, L1 , L2 ). Here, Q is a workload of n ∈ N queries, the natural number B represents the input storage limit in bytes, and S is the database statistics for the database, call it D, on which the queries in Q are to be executed. Further, L1 is the language of views that can be considered in solving the instance, and L2 is the language of rewritings represented by the plans in the solution for this instance. The problem output is a set P = {p1 , . . . , pn } of n evaluation plans, one plan pi for each query qi in Q, such that each plan pi (a) is associated with an equivalent rewriting of

136

M. Kormilitsin et al.

qi in query language L2 , and (b) can reference only stored relations of D and views deﬁned on D in query language L1 . Finally, (1) for the sum s of the sizes (in bytes) of the tables for all the views mentioned in the set of plans P , it holds that s ≤ B, and (2) for the costs c(pi ) of evaluating the plans pi ∈ P , the sum n Σi=1 c(pi ) is minimal among all sets of plans whose views satisfy condition (1). We now provide the details on the database statistics S in the problem input. Access to the database statistics is not handled directly by the algorithms in our architecture. Rather, the algorithms assume availability (and use the standard optimizer APIs) of a module for viewset simulation and evaluation-cost estimation for view-based query plans. That is, we assume the availability of a “what-if” optimizer similar to those used in the work (e.g., [2, 3]) on view/index selection for Microsoft SQL Server. Observe that the use of such a black-box module in our architecture guarantees that the plans in the ADR problem outputs are going to be considered by the actual (“target”) optimizer of the database system once the views mentioned in the plans are materialized. We assume that the target optimizer in question can perform query rewriting using views (see, e.g., [24, 25]) and that the what-if optimizer module used in our architecture uses the same algorithms as the target optimizer in the database system. For the algorithms that we introduce in Sect. 4, in the above problem inputs we restrict the language of input queries, as well as each of languages L1 and L2 , to express SQL queries that are single-block select-project-join expressions whose WHERE clause consists of a conjunction of simple predicates. (This language corresponds to conjunctive queries with arithmetic comparisons, CQACs, see, e.g., [26].) Further, we make the common assumption (see, e.g., [27]) of no cross products in query- or view-evaluation plans. Finally, in the language of input queries and in the language L1 we restrict all queries to be chain queries. Sect. 2 in the full version [23] of this paper contains a motivating example that clariﬁes the problem statement, as well as our use of CQAC queries, views, and rewritings. Please see Sect. 6 for more general classes of the problem inputs that (classes) are covered by generalizations of the approach of Sect. 4. In this paper we use the term “automated database restructuring” to point out the possibility of “inventing” new views (see [8]) in speciﬁc algorithms in stage one of our proposed architecture. This way, the algorithms could ensure completeness of the exploration of the search space of view-based plans. For our query-language restrictions on the algorithms of Sect. 4, our proposed algorithms (presented in that section) are complete in that sense.

3

The Architecture

As discussed in Sect. 1, the problem of view selection is hard for a number of reasons. Thus, most view-selection approaches in the literature rely on heuristics with no guarantees of optimality or approximate optimality in the sense of [28]. Our problem statement ADR, see Sect. 2, emphasizes plans at the expense of views, and thus necessitates a diﬀerent architecture from those proposed in the literature for the view-selection problem. Our architecture is natural for

Systematic Exploration of Eﬃcient Query Plans

Spacebound

137

Queryworkload

PlanSelection. Systematicanalysisofapplicable views/indexesandconstructionof plans.

Simulationof views/indexesand estimationofplan costs.

Setsofgoodview Sets of good viewͲ and and indexͲbasedplans.

EnumerationofPotential Solutions

FinalRecommendation

Fig. 1. Our two-stage architecture

our problem statement, in that the architecture ﬁrst forms a search space of plans, and then does selection of the best combination of plans in that space. An optimal combination of the given plans can be selected by a general Integer Linear Program (ILP) problem solver such as CPLEX [10], as discussed below. As shown in Fig. 1, our architecture has two stages: (1) A search for sets of competitive plans for the input queries (“plan selection”), and (2) Selection of one eﬃcient plan for each input query (“enumeration of potential solutions”). The ﬁrst stage begins with a query workload and, optionally, a space bound. It produces a set of plans and corresponding views so that there is at least one plan for each query. The output of stage one, along with the original query workload and space bound, becomes an input to stage two. All problem-speciﬁc details are encapsulated in stage one of our architecture. For example, we can restrict the nature of the queries allowed, the types of views, the operators, etc. The only output from stage one is a set of IDs of plans and of the views used in the plans. In fact, nothing about the details of the plans or views needs to be conveyed to stage two other than (a) the set of plans for each query; (b) the set of IDs of views required by each plan; (c) the cost of each plan; (d) the space required by each view; and (e) the space bound. Stage two is then free to solve a generalized knapsack problem: Find the lowest-cost plan for each query such that the total space used by the required views is no greater than the given bound. An ILP formulation of this problem is given in [17]. Our work illustrates that we can exploit the performance of CPLEX and other ILP solvers by transforming ADP instances into ILP instances. The biggest challenge is that, for a given query workload, the number of potential views, indexes, and (by extension) plans grows exponentially in the number of queries. This combinatorial explosion plagues all heuristics and algorithms for

138

M. Kormilitsin et al.

view (or index) selection, either directly, or, in most cases, indirectly (e.g., when a heuristic is only able to explore a small portion of the solution space). ILP solvers continue to improve and are already capable of solving 300-query instances of stage two in seconds. An additional advantage of using a general-purpose ILP solver is the ability to incorporate arbitrary new constraints into the problem. Our architecture demonstrates that (a) the diﬃculties can be isolated (in stage one) and, for special cases of practical interest, overcome; and (b) even large instances of the ILP formulation in stage two can be solved eﬃciently. Given the fact that the input to stage two abstracts the details of the original ADP instance and reduces the problem to an ILP model, the following propositions hold with respect to any exact stage-two Algorithm A. Their correctness derives directly from the design of our architecture. Proposition 1. If the set of plans P has a subset P such that P includes at least one optimal plan for each query and the views and indexes required by P satisfy the space limit B, then any solution produced by Algorithm A using the subset P will be optimal with respect to the original query workload and B. Proposition 2. If the set of plans P has a subset P such that the total cost of plans in P is within relative error of the optimum cost for the original query workload and the views and indexes required by P satisfy the space limit B, then any solution produced by Algorithm A will be within relative error of optimal with respect to the original query workload and space B. In other words, the quality of the output of stage one directly determines the quality of the solution produced by our architecture. Most of the remainder of the paper is devoted to implementations of stage one of our architecture for variants of the CQAC problem (Sect. 4), and to experiments conﬁrming the competitiveness of our approach (Sect. 5). Finally (Sect. 6), we discuss extensions of the CQAC problem that can be handled by straightforward generalizations of our algorithms of Sect. 4.

4

Eﬃcient Evaluation Plans for CQAC Queries

We present two variations of a speciﬁc algorithm implementing stage one of our architecture. The algorithm is applicable to conjunctive queries with arithmetic comparisons (CQAC queries), and considers views and rewritings in the language of CQACs. One variation yields optimal solutions in the sense of Proposition 1. The other generates fewer plans, trading optimality for eﬃciency. We introduce the algorithm by way of a simpler single-query algorithm, described more fully in [23]. The standard System-R-style optimizer [29] uses dynamic programming (DP) to ﬁnd best plans for all subqueries of a query, in order of increasing sizes. For each subquery, it creates new plans that join plans for the component subqueries. After that, it chooses and saves the cheapest plan for each “interesting order” of tuples. We adapt this algorithm to ADR, with two important modiﬁcations:

Systematic Exploration of Eﬃcient Query Plans

139

1. In addition to the plans created by joins of subplans, we consider a simulated view that matches the subquery exactly (if not already in the database). This allows us to consider plans with a variety of combinations of simulated views. 2. We keep all relevant plans for each subquery. This is important for feasibility: The views required by the plan for each subquery may satisfy the space bound, but the overall space bound of all the views may fail to do so. The requirement that all plans for a subquery be kept can be relaxed; hence the emphasis on relevance. Let cost(p) be the cost of executing plan p and weight(p) be the total size of all views used by p. Definition 1. Let p1 and p2 be two plans for the same subquery. If p1 and p2 return the same tuples in exactly the same order, such that cost(p2 ) ≤ cost(p1 ) and weight(p2 ) ≤ weight(p1 ) each hold, with at least one strict inequality, then plan p2 dominates plan p1 . It is easy to see that a plan p can be eliminated from consideration whenever p is dominated by another plan p . A proof that the single-query algorithm with this domination rule produces an opimal solution can be found in [23]. We now discuss how to adjust the single-query algorithm to work for multiple CQAC queries. Our proposed algorithm is applicable to chained queries, understood intuitively as multiple chain queries (see Sect. 2) where there exists a single super-chain such that the chain for each query is a sub-chain, see [23].2 A naive approach to processing problem inputs with multiple chained queries would be to ﬁnd plans for each input query separately using the single-query algorithm. This approach has several obvious problems: 1. One problem is with eﬃciency, especially when queries have common parts, as in this case we may end up doing some of the work repeatedly. 2. If we consider queries in isolation, we can miss some structures that are suboptimal for single queries, but are beneﬁcial for groups of queries. (E.g., if one query has an arithmetic comparison (AC) 0 ≤ B ≤ 3, where B is an attribute name, and another has 1 ≤ B ≤ 4, then materializing a view with AC “0 ≤ B ≤ 4” may be a competitive option.) 3. Finally, the pruning rule for the single-query algorithm might remove plans that are needed for an optimal solution, see example in [23]. In what follows we discuss the basic framework of a multi-query algorithm and two pruning rules for it, one that guarantees the presence of an optimal solution for each query, the other trading oﬀ optimality for eﬃciency. Chained Queries without ACs. For chained queries without arithmetic comparisons (ACs), our algorithm uses the same DP structure as the single-query case [23], with an important diﬀerence: Some subchains of the combined chain are not subqueries of any query and do not require creation of plans. 2

The queries in the TPC-H benchmark [30] are chained queries on a chain of length 9, with possibly some additional queries that turn the chain into a cycle or follow a single branch away from the chain. The case of the branch can be incorporated into our approach (see Sect. 6), while the cycle is currently under investigation.

140

M. Kormilitsin et al. Q1

Q3

D14

D13

D12

D1

D24

D23

D2

D58

Q2 D35

D34

D3

D57

D45

D4

D56

D5

D68

D67

D6

D78

D7

D8

Fig. 2. Example of DP lattice for multiple chained queries. (Dij , with i < j, denotes a chain of subgoals i, i + 1, . . . , j − 1, j).

Consider an illustration. Suppose we have two queries Q1 : −ABCD and Q2 : −CDE. (This format of deﬁning the queries enumerates just the predicate names of all the relational subgoals of the queries.) Then the combined chain is ABCDE. Subchain BCDE is not a part of either query, nor is ABCDE – we do not need plans for either of them. The DP lattice for the case of multiple queries resembles a collection of mountain peaks as illustrated in Fig. 2, which shows the lattice for three queries on a combined chain of length 8. In this diagram, circles represent subchains for which we need to construct plans. Examples of subchains for which we do not need to construct plans are D46 , D25 , and D36 . Our pruning rule for the case of a single query may not be correct for the case of multiple queries. Suppose that in the multi-query case, we prune p1 because its cost is higher than that of p2 , and p1 uses at least as much space as p2 . In the multi-query situation, if we replace all occurrences of p1 with p2 then the overall cost of the solution will decrease but the total weight may actually increase, as the views used by p1 might be useful for evaluating other queries. Thus, removal of p1 might actually eliminate feasible plans that take advantage of the space savings when views are used by plans for more than one query. To avoid this problem, we propose the following deﬁnitions. Definition 2. For a given set of queries Q, we say that a view v is exclusive if it can be used by exactly one query in Q. Definition 3. The exclusive weight of plan p, ew(p), is the total (i.e., sum-of ) weight of the exclusive views of p. Definition 4. Let p1 and p2 be two plans for the same subquery. If p1 and p2 return the same tuples in exactly the same order, such that cost(p2 ) ≤ cost(p1 ) and weight(p2 ) ≤ ew(p1 ) each hold, with at least one strict inequality, then plan p2 globally dominates plan p1 . In our algorithm MultiQueryPlanGen of Fig. 3, procedure PrunePlans removes all plans p such that p is globally dominated by another plan p for the same subquery. In order to prune plans that are globally dominated, we keep track of both the exclusive weight and the total weight of each plan. Then we can

Systematic Exploration of Eﬃcient Query Plans

141

Algorithm 1. MultiQueryPlanGen Input: database statistics (see Sect. 2), set of CQAC queries Q that together form chain H, space bound B Output: a set of plans for each query in Q, containing optimal solution to ADR for each sub-chain q of H in the order of increasing size do for each split q into two smaller sub-chains q1 and q2 do for each pair of plans p1 ∈ plans(q1 ) and p2 ∈ plans(q2 ) do if queries(p1 ) ∩ queries(p2 ) = ∅ AND total weight of p1 and p2 is at most B then create plan p by joining p1 and p2 ; queries(p) = queries(p1 ) ∩ queries(q2 ); save p into plans(q); let Q be the set of queries for which q is the subset of tables; for each k ⊆ Q do simulate view v which is the result of the join of tables in q with the disjunction of the sets of constraints of queries in k applied to it; if the size of v is at most B then create plan p based on v; initialize queries(p) with indexes of queries in k; save plan p into plans(p); PrunePlans(q); return set of plans for each query in Q Fig. 3. Constructing (view-based) evaluation plans for multiple CQAC queries

execute PrunePlans either by comparing each pair of plans or, more eﬃciently, by ﬁrst sorting the plans by cost or by maintaining a search tree. Adding arithmetic comparisons. We now discuss what happens when we allow selection conditions in the WHERE clause of the input queries. For ease of exposition, we assume that all the selection conditions are range (i.e., inequality) arithmetic comparisons (ACs), although, as we explain later, most of the techniques that we discuss here apply to other types of selection conditions. In presence of ACs, our algorithm needs several adjustments. First, when the input has multiple queries with diﬀerent ACs, two plans built on the same set of tables might diﬀer w.r.t. their ACs and not be usable for the same set of queries. As a result, the same node might contain plans for diﬀerent subqueries. Thus, for each plan we need to keep a list of queries that can use this plan. Second, we must take care of so-called merged views – views that are usable by more than one query. If we have two queries that use the same subset of tables but diﬀerent sets of ACs, then it may beneﬁt both to create a merged view whose set of ACs is the disjunction (i.e., OR) of the ACs of the queries. In theory, the number of candidate merged views is exponential in the number of queries, but in practice this number is much lower. Suppose we have n > 2 queries that overlap on the same larger chain (set of tables). Suppose query Q1

142

M. Kormilitsin et al.

has ACs on attributes B1 and B2 , query Q2 has ACs on B2 and B3 , etc. Then the merged view for queries Qi and Qj , such that |i − j| > 1, is a simple join of the tables in the underlying chain, with no ACs on it. The same is true for any subset of the queries containing more than two queries. Therefore, in this case, we have only n − 1 possible merged views. (See [23] for another example.) Our approach to generating eﬃcient query plans for the CQAC version of problem ADR is encoded in the algorithm MQ of Fig. 3. The algorithm uses two auxiliary structures: plans(q) is the list of plans for subquery q; queries(p) is the list of IDs of the queries that can use (partial) plan p. Theorem 1. Algorithm MQ returns a set of view-based plans P such that there exists S ⊆ P where S is an optimal set of plans. The proof can be found in [23]. Theorem 1 is a very important result. It means that Algorithm MQ performs a systematic investigation of the search space of view-based plans and returns a (reduced-size) list of plans that contains an optimal solution. Thus, the solution quality of the two-stage architecture that we presented in Sect. 3 depends only on the quality guarantees of the algorithm used in stage two of the architecture. Combined with Propositions 1 and 2, this result provides strong optimality guarantees for our overall architecture when applied to the CQAC class of problem inputs considered in this section. More aggressive pruning. The strong point of Algorithm MQ is that it preserves optimality. Unfortunately, its runtime grows exponentially in the number of queries, as one might expect from an algorithm that solves an NP-hard problem. We now describe a few more aggressive pruning rules that remove many more plans at the expense of losing the optimality guarantee. These rules suggest a family of algorithms for processing CQAC queries at stage one of our architecture, which might represent points along the time/quality continuum. In an example in the full verion [23] of this paper we demonstrate that the pruning rule that we used in our single-query algorithm does not guarantee optimality if used for the multiple-query case, as it does not account for the views that are shared by multiple queries. At the same time, our experiments (Sect. 5) suggest that the single-query rule, even if applied to the case of multiple queries, does not signiﬁcantly reduce the quality of the solution. The idea of our second aggressive pruning rule is to limit the number of plans we keep for each subproblem: We keep only k plans with the largest prof it ∗ queries/size, where prof it is the decrease in cost oﬀered by the plan (over use of base tables), queries is the number of queries that can use the plan, and size is the total size of the views used by the plan. Please see [23] for the details.

5

Experimental Results

The experiments reported in this section address two questions: (a) How does our two-stage approach compete with a relaxation-based approach (RBA), such as that of [3]? (b) Can we safely assume that stage two is not the bottleneck when using CPLEX [10]? Our extensive and thorough investigation gives a positive

Systematic Exploration of Eﬃcient Query Plans

143

response to each question. An extended report including a discussion of how we generated realistic random instances can be found in [23]. Comparisons with RBA. The RBA algorithm that we use for comparison purposes is the one described in [3]. It consists of two main stages: (a) choose a set of physical structures (i.e., views or indexes) that is guaranteed to result in an optimal conﬁguration, but may take too much space; and (b) “shrink” it using transformations, such as view and index merging or index preﬁxing. In the experiments comparing our two-stage approach with the RBA of [3] we examined the quality achieved by each approach within a speciﬁed time period. Our own implementation of [3] was used since we did not have access to the code for it. Any comparison using time as a measure may therefore not be representative of the actual relative performance of the two approaches. Thus we use a combinatorial measure, as follows. In the physical database design literature it is a common practice to use the total number of query-optimization calls as a measure of time spent on optimization. We use a more ﬁne-grained unit of time — size estimation. For each operator in a plan tree, the optimizer estimates the size of the results, the execution cost, and the order of tuples of the result. Thus, any query-optimization call consists of a series of size and statistics approximation calls. Although size estimations are not the only operations performed by the optimizer, they tend to dominate the runtime of query optimization. We ran the RBA algorithm on each instance until it has done as many size estimations as the MQ algorithm of Sect. 4, using the single-query pruning rule; we refer to this algorithm as AR, aggressive (pruning) rule. We then compared the solution quality of the two algorithms.3 Fig. 4(a) shows that in almost all cases our algorithm achieves higher quality solutions. In a few cases RBA failed to ﬁnd a feasible solution. To make sure that the cutoﬀ does not preclude solutions that are almost encountered by RBA, we also allowed RBA four times as many size estimations as our algorithm, see Fig. 4(b). Note that the RBA’s performance does not improve much vs. AR even when allowed four times as many size estimations. In fact in only three (out of 250) cases, RBA could improve its solution in comparison to AR. Both sets of results diﬀer depending on the relationship between the space bound and the size of the query results. Figure 4 shows that our approach gets better relative to RBA as the problem gets harder (less space is available). The diﬀerence becomes dramatic with only a slight increase in diﬃculty. Results with More Aggressive Pruning. Recall that the runtime of AR is exponential in the worst case. Even in practice its runtime increases dramatically with increasing query size. To mitigate this, we experimented with a version of AR that sets a limit on the maximum number of plans kept for each subquery. The choice of plans is made heuristically using the k plans with largest proﬁt ∗ queries/size, where proﬁt is the decrease in cost oﬀered by the plan (over use of 3

Stage two using CPLEX is also executed, taking a small fraction of the total time.

144

M. Kormilitsin et al.

Nu umbeerofinsta ancess.

AR is better than RBA ARisbetterthanRBA

AR is the same as RBA ARisthesameasRBA

AR is worse than RBA ARisworsethanRBA

50 45 40 35 30 25 20 15 10 5 0 0.9 0.7 0.5 0.3 0.1 Space bound as fraction of total size of query answers Spaceboundasfractionoftotalsizeofqueryanswers.

Num mberofin nstancces.

(a) RBA is allowed the same number of size estimations 50 45 40 35 30 25 20 15 10 5 0

AR is better than RBA ARisbetterthanRBA

AR is the same as RBA ARisthesameasRBA

AR is worse than RBA ARisworsethanRBA

0.9 0.7 0.5 0.3 0.1 Space bound as fraction of total size of query answers Spaceboundasfractionoftotalsizeofqueryanswers.

(b) RBA is allowed four times as many size estimations Fig. 4. Comparing solution quality of our AR (Sect. 4) versus the RBA of [3]

base tables), queries is the number of queries that can use the plan, and size is the total size of the views in the plan. With k = 20 – call this AR20 (smaller values of k did not improve runtime signiﬁcantly), – we were able to handle instances with up to 80 queries in about a minute.4 This is opposed to requiring two minutes for 65 queries and being unable to ﬁnish processing 70 within 10 minutes. (We still need to demonstrate that AR20 is competitive when it comes to solution quality.) Fig. 5 shows the performance of AR20 vs. RBA using the same setting as that of Fig. 4. The superiority of AR20 w.r.t. solution quality is not as dramatic as with AR, but still clear. Where AR wins practically all the time with the space bound less than 100% of query size, AR20 catches up gradually and attains superiority at 50%. After that, the relative results do not change signiﬁcantly. 4

All runtimes are for an AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, with 2 MB of L2 cache, and 4 GB memory, running Red Hat Enterprise Linux 5.

Systematic Exploration of Eﬃcient Query Plans

AR20 is better than RBA AR20isbetterthanRBA

AR20 is the same as RBA AR20isthesameasRBA

AR20 is worse than RBA AR20isworsethanRBA

40 35

Num mberofin nstancces.

30 25 20 15 10 5 0

0.9 0.7 0.5 0.3 0.1 Space bound as fraction of total size of query answers Spaceboundasfractionoftotalsizeofqueryanswers.

Fig. 5. AR20 versus RBA for diﬀerent space bounds AR20isbetterthanRBA AR20 is better than RBA 40

AR20 is the same as RBA AR20isthesameasRBA

AR20 is worse than RBA AR20isworsethanRBA

Num mberofinstancces.

35 30 25 20 15 10 5 0 5

10

15

20

25

30 35 40 45 Numberofqueries. Number of queries

50

55

60

65

70

(a) Space bound is 0.5 times total query size

Num mberofinstancces.

AR20 is better than RBA AR20isbetterthanRBA

AR20 is the same as RBA AR20isthesameasRBA

AR20 is worse than RBA AR20isworsethanRBA

50 45 40 35 30 25 20 15 10 5 0 5

10

15

20

25 30 35 40 45 Number of queries Numberofqueries.

50

55

60

65

70

(b) Space bound is 0.1 times total query size Fig. 6. AR20 versus RBA as a function of number of queries

145

146

M. Kormilitsin et al.

Another way to evaluate AR20 is how its performance vis a vis RBA scales with the number of queries. Since AR20 makes signiﬁcantly fewer size-estimate calls than AR, we end up allowing fewer size estimates for RBA. When the space bound is 50% of query size – Fig. 6 – the results are mixed; AR20’s advantage in “speed” is oﬀset by poorer relative solution quality. However, when the space bound is 10% of query size, the instances are much harder for RBA, while AR20 is still able to come up with signiﬁcantly better solutions. Scalability Results. One concern with our two-stage architecture is that the second stage, which solves a generic integer programming problem instead of using problem-speciﬁc techniques, might incur prohibitive runtime. Our experiments suggest otherwise. We generated random instances ranging from 25 to 300 queries using the techniques described in [23], being careful to set parameters so that the instances had characteristics similar to the stage one outputs of our smaller instances. Runtimes (using CPLEX 11.0) ranged from less than a second (for 100 queries or fewer) to 10 seconds (for 300 queries). We also experimented with larger instances: the largest instance that we were able to solve within 10 minutes using CPLEX had 800 queries, 1074 plans per query, and 6886 views. Summary. In the experiments reported here we showed that our stage one algorithm with an aggressive pruning rule yielded better solution quality than an RBA approach most of the time. This continued to hold even for a faster version of our algorithm that pruned more plans at every step. We also demonstrated that the computational eﬀort in stage two is relatively insigniﬁcant and scales well, allowing very large instances to be solved. We are currently considering even more aggressive pruning rules, to bring down the runtime of stage one while still yielding high-quality solutions. In addition, we are investigating stage one algorithms for other database schema.

6

Discussion and Extensions

We have considered the problem of selecting views or indexes that minimize the evaluation costs of the frequent and important queries under a given upper bound on the available disk space. To solve the problem, we proposed a novel end-toend approach that focuses on systematic exploration of (possibly task-based) plans for evaluating the input queries. Speciﬁcally, we proposed a framework (architecture, see Sect. 3) and algorithms (Sect. 4) that enable selection of those tasks that contribute to the most eﬃcient plans for the input queries, subject to the space bound. We presented strong optimality guarantees on the proposed architecture. Our proposed algorithms search for sets of competitive view-based plans for queries expressed in the language of conjunctive queries with arithmetic comparisons (CQACs). This language captures the full expressive power of SQL select-project-join queries, which are common in practical database systems. Our experimental results on synthetic and benchmark instances (see Sect. 5) corroborate the competitiveness and scalability of our approach.

Systematic Exploration of Eﬃcient Query Plans

147

We now focus on some classes of problem inputs and of allowed views and rewritings/plans that (classes) subsume the CQAC class to which our algorithms of Sect. 4 are applicable. Recall that our architecture has two stages: (1) A search for sets of competitive plans for the input queries, and (2) Selection of one eﬃcient plan for each input query. The plans in the input to and output of the second stage are formulated as sets of IDs of those tasks whose materialization would permit evaluation of the plans in the database. Thus, all problem-speciﬁc details, including the query languages for the input queries and for the views/rewritings of interest to eﬃcient evaluation of the queries, are encapsulated in stage one of our architecture. It follows that to capture more general query, view, and rewriting languages, as well as the presence of indexes, by our overall approach, it is suﬃcient to develop algorithms for stage one only of our proposed architecture. Several extensions, based on our stage one algorithm(s), are discussed in more detail in [23]: (a) incoporating indexes5 ; (b) selections other than arithmetic comparisons; (c) queries, views, and rewritings with grouping and aggregation; and (d) queries in conﬁgurations other than chains.

References [1] Agrawal, S., Chaudhuri, S., Koll´ ar, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database tuning advisor for Microsoft SQL Server 2005. In: VLDB (2004) [2] Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: VLDB, pp. 496–505 (2000) [3] Bruno, N., Chaudhuri, S.: Automatic physical database tuning: A relaxation-based approach. In: SIGMOD, pp. 227–238 (2005) [4] Bruno, N., Chaudhuri, S.: Physical design reﬁnement: The merge-reduce approach. ACM Transactions on Database Systems 32(4), 28–43 (2007) ¨ [5] Balmin, A., Ozcan, F., Beyer, K.S., Cochrane, R., Pirahesh, H.: A framework for using materialized XPath views in XML query processing. In: VLDB (2004) [6] Valentin, G., Zuliani, M., Zilio, D.C., Lohman, G.M., Skelley, A.: DB2 advisor: An optimizer smart enough to recommend its own indexes. In: ICDE (2000) [7] Zilio, D.C., Zuzarte, C., Lightstone, S., Ma, W., Lohman, G.M., Cochrane, R., Pirahesh, H., Colby, L.S., Gryz, J., Alton, E., Liang, D., Valentin, G.: Recommending views and indexes with IBM DB2 design advisor. In: ICAC (2004) [8] Chirkova, R.: Automated Database Restructuring. PhD thesis, Stanford U. (2002) [9] Chaudhuri, S., Datar, M., Narasayya, V.R.: Index selection for databases: A hardness study and principled heuristic solution. IEEE TKDE 16, 1313–1323 (2004) [10] ILOG: CPLEX Homepage (2004), http://www.ilog.com/products/cplex/ [11] Gupta, H., Harinarayan, V., Rajaraman, A., Ullman, J.D.: Index selection for OLAP. In: ICDE (1997) [12] Gupta, H., Mumick, I.S.: Selection of views to materialize under a maintenance cost constraint. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 453–470. Springer, Heidelberg (1999) [13] Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes eﬃciently. In: SIGMOD (1996) 5

In stage two an index is just another entity requiring additional space.

148

M. Kormilitsin et al.

[14] Karloﬀ, H.J., Mihail, M.: On the complexity of the view-selection problem. In: PODS (1999) [15] Asgharzadeh Talebi, Z., Chirkova, R., Fathi, Y., Stallmann, M.: Exact and inexact methods for selecting views and indexes for OLAP performance improvement. In: EDBT (2008) [16] Chaudhuri, S., Narasayya, V.R.: An eﬃcient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997) [17] Kormilitsin, M., Chirkova, R., Fathi, Y., Stallmann, M.: View and index selection for query-performance improvement: Quality-centered algorithms and heuristics. In: CIKM (2008) [18] Gupta, A., Mumick, I.S., Rao, J., Ross, K.: Adapting materialized views after redeﬁnitions: techniques and a performance study. Inf. Sys. 26(5), 323–362 (2001) [19] Mistry, H., Roy, P., Sudarshan, S., Ramamritham, K.: Materialized view selection and maintenance using multi-query optimization. In: SIGMOD, pp. 307–318 (2001) [20] Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Eﬃcient and extensible algorithms for multi query optimization. In: SIGMOD, pp. 249–260 (2000) [21] Bruno, N., Chaudhuri, S.: Online approach to physical design tuning. In: ICDE 2007 (2007) [22] Bruno, N., Chaudhuri, S.: Constrained physical design tuning. PVLDB 1 (2008) [23] Kormilitsin, M., Chirkova, R., Fathi, Y., Stallmann, M.: Systematic exploration of eﬃcient query plans for automated database restructuring. Technical Report TR2009-8, NCSU (2009), http://www.csc.ncsu.edu/research/tech/reports.php [24] Gou, G., Kormilitsin, M., Chirkova, R.: Query evaluation using overlapping views: Completeness and eﬃciency. In: SIGMOD, pp. 37–48 (2006) [25] Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Optimizing queries with materialized views. In: ICDE, pp. 190–200 (1995) [26] Klug, A.: On conjunctive queries containing inequalities. J. ACM 35, 146–160 (1988) [27] Ono, K., Lohman, G.: Measuring the complexity of join enumeration in query optimization. In: VLDB, pp. 314–325 (1990) [28] Ausiello, G., Crescenzi, P., Gambosi, G., Kann, V., Marchetti-Spaccamela, A., Protasi, M.: Complexity and Approximation. Springer, Heidelberg (1999) [29] Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD (1979) [30] TPC-H:: TPC Benchmark H, http://www.tpc.org/tpch/spec/tpch2.1.0.pdf

Using Structural Joins and Holistic Twig Joins for Native XML Query Optimization Andreas M. Weiner and Theo H¨ arder Databases and Information Systems Group Department of Computer Science University of Kaiserslautern 67653 Kaiserslautern, Germany {weiner,haerder}@cs.uni-kl.de

Abstract. One of the most important factors for success of native XML database systems is a powerful query optimizer. Surprisingly, little has been done to develop cost models to enable cost-based optimization in such systems. Since the entire optimization process is so complex, only a stepwise approach will lead to a satisfying (future) solution. In this work, we are paving the way for cost-based XML query optimization by developing cost formulae for two important join operators, which allow to perform join reordering and join fusion in a cost-aware way, and, therefore, make joint application of Structural Joins and Holistic Twig Joins possible.

1

Introduction

In the last few years, XML became a de-facto standard for the exchange of structured and semi-structured data in business and in research. Amongst others, the quality of query optimizers plays an important role for the acceptance of database systems by a wide range of users. Even though we can look retrospectively at 30 years of research on cost-based query optimization in relational database systems – a fairly complex task –, cost-based optimization in native XML database management systems (XDBMSs) is diﬀerent. It is much more than providing cardinality estimation for homogeneous row sets or simple valuebased join reordering and thus raises additional challenges: (1) In the world of XML, schema evolution and cardinality variations will occur much more often than in the relational world. This adds an additional degree of fuzziness to element-cardinality estimation. (2) In the relational world, value-based cardinality estimation worked ﬁne, due to the simplicity of the data model with its homogeneous rows. As XML explicitly supports heterogeneity, two documents complying with a given schema do not necessarily share the same structure, e. g., an element e can occur in document d1 several times, but never in document d2 . (3) XML is based on a hierarchical data model where structural relationships,

Financial support by the Research Center (CM)2 of the University of Kaiserslautern is acknowledged (http://cmcm.uni-kl.de)

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 149–163, 2009. c Springer-Verlag Berlin Heidelberg 2009

150

A.M. Weiner and T. H¨ arder

e. g., child and descendant, – in addition to classic value-based relationships – play an important role especially during query evaluation. Hence, an XML query optimizer has to deal with value-based joins as well as with structural joins and must arrange them in an optimal way. (4) Since XML relies on an ordered data model, physical operators must preserve this property. 1.1

Problem Statement

Nowadays, structural relationships on XML documents, such as child or descendant, are eﬃciently evaluated using Structural Joins (SJs) [2] and Holistic Twig Joins (HTJs) [4]. SJs decompose an XPath path expression into n binary structural relationships, evaluate each of those relationships separately, and ﬁnally “stitch” the results together. In contrast, HTJs can evaluate path expressions – or even more complex structures like so-called twig query patterns – as a whole. Both types of operators must provide their results in document order and sometimes need to perform duplicate elimination which adds additional complexity to them. Compared to classic value-based join operators like nested-loops join or sort-merge join, estimating the CPU costs of SJ or HTJ operators is hard and needs much more eﬀort and empirical analysis. For example, HTJ operators like TwigOptimal [8] are only at the ﬁrst glance simple n-way join operators, because they can perform jumps on their input lists and, therefore, do not need to process them completely. Even though index access operators are inevitable for fast XML query evaluation, we cannot assume that they are available by default in an XDBMS, because their maintenance can cause substantial overhead. For this reason, we need to make the most out of SJ and HTJ operators which will remain ﬁrst-class citizens in native XDBMSs. As argued since the very beginning of HTJ operators, they are claimed to outperform SJ operators in low-selectivity scenarios. The interesting but yet unanswered question is – when? Answering this question is essential for the development of good cost formulae in particular and a proper cost model in general. Compared to cost formulae for relational join operators, which can be easily derived by just “looking” at the algorithms1 , join operator selection becomes a diﬃcult task in the XML world. Due to the very complex SJ and HTJ operators, an empirical approach that tests SJ and HTJ operators under real system conditions against each other seems to be more beneﬁcial than a simple one-to-one competition of algorithms. 1.2

Our Contribution

The contribution of this work can be summarized as follows: We ﬁrst compare empirically a prominent SJ (StackTree) and HTJ (TwigOptimal ) operator regarding costs in diﬀerent selectivity scenarios. By doing so, we derive break-even points where the SJ operator outperforms the HTJ operator and vice versa. 1

To convince yourself, compare the classic nested-loops join algorithm with an arbitrary HTJ algorithm.

Using Structural Joins and Holistic Twig Joins

151

Based on this information, we are ready to develop cost formulae describing the relative CPU costs of both operators. By utilizing these formulae, we provide an initial cost model for native XML query processing which empowers an XML query optimizer to base its decision on expected execution and IO costs rather than on simple heuristics. Even though these formulae are only a ﬁrst step towards the development of cost-based XML query optimizers, we show that we can use it now to perform cost-aware join reordering and join fusion. Although HTJ operators are outperforming SJ operators in many situations, we identify the circumstances where a joint query evaluation strategy can lead to tremendous savings in query execution time. While our model is derived empirically, we show that it is stable enough to be competitive in more complex evaluation scenarios. Moreover, it supports the query optimizer in choosing a cheaper over an expensive plan, especially when the relative cost diﬀerences are very high. 1.3

Related Work

Since the research directions of cost-based XML query optimization and XML cost modelling are just emerging, there are only few related publications. The classic work of McHugh and Widom [9] on the optimization of XML queries focuses only on optimizing path expressions using navigational access methods and lacks support for SJ and HTJ operators. Wu et al. [16] proprose ﬁve novel dynamic programming algorithms for structural join reordering. Their approach is orthogonal to our work, i. e., it can be employed to choose the best join order in SJ-only scenarios. Compared to our work, they use only a very simple cost model for driving the join-reordering process and do not consider the combination of SJ and HTJ operators. Zhang et al. [17] introduce several statistical learning techniques for XML cost modelling. In contrast to our work, which follows a static cost modelling approach, they demonstrate how to model the costs of a navigational access operator. Unfortunately, they do not cover SJ and HTJ operators. Balmin et al. [3] sketch the development of a hybrid cost-based optimizer for SQL and XQuery being part of DB2 XML. Compared to our approach, they evaluate every path expression using an HTJ operator and cannot decide on a ﬁne-granular level whether to use SJ operators or not.

2

XML Query Processing in a Nutshell

Structural Joins and Holistic Twig Joins produce and work on streams of element nodes. For example, for evaluating the structural relationship book/author on a given document, a SJ operator takes two inputs: all book nodes and all author nodes and returns all author nodes that satisfy the structural predicate /. Preconditional for SJ and HTJ operators is a node labeling scheme that assigns each node in an XML document an unique identiﬁer that (1) allows to decide, without accessing the document, for two given nodes whether they are structurally

152

A.M. Weiner and T. H¨ arder

Fig. 1. A sample XML document labeled with DeweyIDs

related to each other and (2) does not require re-labeling even after modiﬁcations to the document. In this work, we rely on the basic storage infrastructure of the XML Transaction Coordinator (XTC) – our prototype of a native XDBMS [6]. In our system, we use so-called DeweyIDs [5] as node labels which allow to eﬃciently decide all structural relationships deﬁned for the query language XPath resp. XQuery. Figure 1 shows a sample XML document which is labeled using DeweyIDs. For providing access to element node streams, we restrict our discussion to two access methods as depicted in Fig. 2. By default, the so-called document index [6] serves as our primary access method. Figure 2(a) shows a sample document index for the XML tree illustrated in Fig. 1. The tree contains the DeweyIDs assigned to each node in the document as keys in ascending order. Each key refers to the data page that contains the corresponding record. For accessing all book element nodes, all data pages have to be accessed. It is obvious that such an access method can only be eﬃcient in high-selectivity scenarios. Therefore, we can employ a secondary access structure which is called element index [6]. It consists of a name directory where the element names serve as keys and point to node-reference indexes implemented as B*-trees. Each nodereference index contains the DeweyIDs of the corresponding element node instances in ascending document order. Using this structure, eﬃcient access to element nodes can be assured. For example, for accessing all author element nodes, only the data pages of the corresponding node-reference index have to be scanned.

(a) Document index

(b) Element index

Fig. 2. Primary and secondary access methods

Using Structural Joins and Holistic Twig Joins

3

153

Towards Joint Query Evaluation Using Structural Joins and Holistic Twig Joins

Since the advent of the ﬁrst relational query optimizers, one of their main objectives has been ﬁnding an optimal orchestration of join operators. This statement still holds in the context of native XDBMSs, thus, we could save the time for selecting an optimal join order and evaluate all path expressions using an n-way HTJ operator. Obviously, this would result in a much smaller search space, but would also take away all room for optimization. Consequently, the chances are very high that the optimizer misses the optimal (or at least near-optimal) query plan. Therefore, we suggest Join Fusion as novel operation for an XML query optimizer that allows to fuse two adjacent SJ operators to an HTJ operator, if and only if child or descendant relationships are evaluated [15]. Using this strategy, a query optimizer can use SJ operators (still allowing classical join reordering) as well as HTJ operators to evaluate path expressions eﬃciently. 3.1

Preliminaries and Nomenclature

Due to space restrictions, we cannot compare all SJ and HTJ operators proposed so far. Therefore, we restrict our discussion to two prominent representatives of SJ resp. HTJ operators: StackTree [2] and TwigOptimal [8]. The goal of this analysis is identifying selectivity intervals in which StackTree outperforms TwigOptimal and vice versa. For the comparison of both join operators, we choose – without loss of generality – a snapshot of the well-known DBLP bibliography2 which has currently a size of around 400 MB. It provides a real-world XML document which allows to eﬀectively compare the execution times of both operators in varying selectivity scenarios. All experiments were done on an Intel Pentium IV computer (two 3.20 GHz CPUs, 1 GB of main memory, 80 MB of external memory) running Linux with kernel version 2.6.13. Our native XDBMS server – implemented using Java version 1.6.0 06 – was conﬁgured with a page size of 4 KB and a buﬀer size of 250 4-KB frames. Instead of comparing absolute execution times, we choose a relative measure called the Relative Performance Gain (RPG). RPG is always deﬁned w. r. t. StackTree or TwigOptimal whose RPG value is normalized to 1.0. For the execution timings of StackTree and TwigOptimal – ts and tt , respectively –, we deﬁne the RPG value w. r. t. StackTree as RPGs = tt /ts and the RPG value w. r. t. TwigOptimal as RPGt = ts /tt . For example, if RPGs > 1.0, then the StackTree operator outperforms the TwigOptimal operator. We restrict our discussion on arbitrary long path expressions consisting of location steps (e. g., a/b) and predicate steps (e. g., a[b]) including several path predicates. Our discussion focuses on the child and descendant axis, because almost all SJ and HTJ operators cannot evaluate other axes. 2

http://dblp.uni-trier.de/xml/

154

A.M. Weiner and T. H¨ arder Table 1. Nomenclature Name

Definition

RPG Cardd (E) Cardout d (s) σd (sl ) σd (sp ) rd (s) hd (sl )

Average relative performance gain (RPG) Total number of occurrences of element E Total number of elements satisfying the step s = Cardout d (sl )/Cardd (Ej ); (Step-wise selectivity) = Cardout d (sp )/Cardd (Ei ); (Predicate selectivity) = Cardd (Ei )/Cardd (Ej ); (Input ratio) = Cardout d (sl )/Cardd (Ei ); (Hit ratio)

For an XPath axis θi ∈ {/, //}, a location step sl = Ei θi Ej , a predicate step sp = Ei [θi Ej ] – s for short –, and a document d, Table 1 shows the deﬁnitions used throughout the rest of this work. 3.2

Cost Analysis

For the cost analysis of the StackTree and the TwigOptimal operator, we explore the execution times of location steps and predicate steps. For a location step sl = Ei θi Ej and a predicate step sp = Ei [θi Ej ] – s for short –, we assume an uniform distribution of values and have a look at two diﬀerent cases: (1) the cardinality of Ei is less than the cardinality of Ej which leads to rd < 1.0, and (2) the cardinality of Ei is greater or equal than the cardinality of Ej which results in rd ≥ 1.0. Case 1: Cardd (Ei ) < Cardd (Ej ) If the cardinality of element Ei is lower than the cardinality of element Ej , we can distinguish between three diﬀerent scenarios w. r. t. the hit ratio hd : (1) If hd = 1.0, then, on average, every Ei node ﬁnds an Ej node as join partner, e. g., for book/title, every book node has a related title node, (2) if hd > 1.0, then, on average, every Ei node has more than one Ej node as matching join partner, e. g., in article//author, every article node has on average more than one author node as descendant node, and ﬁnally (3) if hd < 1.0, then, on average, less than one Ei node ﬁnds an Ej node as join partner, e. g., for book[URL], not every book node has an URL child node. Scenario 1: Cardd (Ei ) = Cardout d (sl ) Figure 3 shows the experimental results for hd = 1.0 regarding the input ratio, the selectivity, and the RPG of the StackTree operator. Up to a selectivity of around 19 %, the StackTree operator has an RPG value of less than 1.0. This means that the TwigOptimal operator outperforms the StackTree operator as long as the selectivity is below this margin. In turn, if we have a selectivity larger than 19 %, the StackTree operator outdistances the TwigOptimal operator. For example, having σd = 0.37 and rd = 0.37, we get an RPG value of 1.03; and for σd = 0.60 and rd = 0.69, we ﬁnd an RPG value of 1.105 which results in 10.5 % longer execution time for the TwigOptimal operator. Based on this analysis, we derive function fs1 describing

Relative Performance Gain of StackTree

Using Structural Joins and Holistic Twig Joins

155

1.12 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96

1.15 1.1 1.05 1 0.95 0.9 0.7 0.6 0.5 0

10

0.4 20

0.3 30

40

Selectivity in %

Input Ratio

0.2 50

60

0.1 70 0

Relative Performance Gain of StackTree

Fig. 3. Scenario 1: Cardd (Ei ) = Cardout d (sl ) 1.035 1.03 1.025 1.02 1.015 1.01 1.005 1 0.995 0.99 0.985

1.12 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96 2.4 2.2 0

2 10

1.8 20

30

Selectivity in %

1.6 40

50

Hit Ratio

1.4 60

1.2

Fig. 4. Scenario 2: Cardd (Ei ) < Cardout d (sl )

the approximated RPG value for 0 ≤ σd ≤ 1.0 for the StackTree operator3 : fs1 (σd ) = 0.16 · σd + 0.967714 Scenario 2: Cardd (Ei ) < Cardout d (sl ) Figure 4 depicts the experimental results for hd > 1.0. In this case, the TwigOptimal operator outperforms the StackTree operator up to a selectivity of approximately 17 %. If both operators face a selectivity larger than 17 %, then the RPG value of StackTree is greater than 1.0, i. e., it outﬂanks the TwigOptimal operator. Based on our experimental results, function fs2 provides the RPG value w. r. t. 0 ≤ σd ≤ 1.0 for the StackTree operator: fs2 (σd ) = 0.06 · σd + 0.988538 Scenario 3: Cardd (Ei ) > Cardout d (sl ) Compared to the former two scenarios, the analysis of case hd < 1.0 shows an ambiguous picture: For selectivities less 3

Please note, the RPG value for TwigOptimal can be calculated using fs1 (σd )−1 .

A.M. Weiner and T. H¨ arder

Relative Performance Gain of TwigOptimal

156

1.5 1.4 1.4

1.3

1.3

1.2

1.2

1.1

1.1

1

1

0.9

0.9 1 0.8 0

0.6 10

20

30

0.4 40

Selectivity in %

50

60

Hit Ratio

0.2 70

80

0

Fig. 5. Scenario 3: Cardd (Ei ) > Cardout d (sl )

than 0.0018 % and hd < 0.022, the StackTree operators performs badly, resulting in an RPG value for the TwigOptimal operator of 1.44, i. e., the SJ operator needed 44 % more time for executing the query. For a selectivity of less than 0.00018 %, we get an RPG value of 1.145 and for selectivities between 0.00018 % and 86.0 %, this value decreases to an RPG value of 1.03. In other words, on average, the supremacy of the TwigOptimal operator over the StackTree operator decreases with increasing selectivity. Even though Fig. 5 reveals that – between a selectivity of 37 % and 69 % – the StackTree operator marginally outperforms the TwigOptimal operator, we believe that a robust cost formula should not take this special situation into account. Ergo, we assume a linear decrease of the RPG value instead. Consequently, we get an approximation of the RPG value w. r. t. 0 ≤ σd ≤ 1.0 described by function fs3 for the TwigOptimal operator: fs3 (σd ) = −0.18 · σd + 1.08373 Case 2: Cardd (Ei ) ≥ Cardd (Ej ) Scenario 4: If the cardinality of element Ei is larger than or equal to the cardinality of element Ej , our experiments did not reveal any break-even points. In this situation, the TwigOptimal operator outperforms the SJ operator by up to 70 % (σd = 0.018). Even though, the RPG value of the SJ operator is decreasing with increasing selectivity, it still remains very high, e. g., for a selectivity of 70.8 %, we still get an RPG value of 1.61, resulting in a 61 % longer execution time for the StackTree operator. Based on our experimental data, the RPG value of the TwigOptimal operator (for 0 ≤ σd ≤ 1.0) is 1.595. We believe that it is not beneﬁcial to use this value as a constant in a cost formula, because it could overweight the costs of location steps or predicate steps matching this scenario – especially in long path expressions. Consequently, we only assume a slightly better RPG value for the TwigOptimal operator resulting in function fs4 : fs4 (σd ) = 1.003

Using Structural Joins and Holistic Twig Joins

3.3

157

Conclusions of the Cost Analysis

Even though there are many other XML documents we did not consider for the cost analysis, we believe that the formulae for estimating the RPG value are universal. This is true, because the inputs for both operators are provided by accessing an element index (Sect. 2), which is clustering element nodes by their element names. Since this data structure abstracts from diﬀerent document characteristics (e. g., degree of recursiveness and regularity) and because we explore four scenarios, which are relevant for all types of documents, functions fs1 , . . . , fs4 reﬂect the estimated RPG value of both join operators on arbitrary XML documents. Therefore, cost formulae based on this analysis can serve as an initial cost model.

4

The Cost Model

To evaluate a path expression using a SJ or HTJ operator requires the XDBMS ﬁrst to access all relevant streams of element nodes. In general, the costs for accessing all instances of element node E is calculated according to the following formula: Costaccess (E) = IO cost + w ∗ CPU cost The weighting factor w allows the database administrator to ﬁne-tune the cost formulae to CPU-bound or IO-bound hardware settings. The IO costs are determined by the number of data pages that have to be loaded into a cold database buﬀer. On the other hand, we model the computational costs as a function of the number of input nodes to be processed. In Sect. 2, we brieﬂy introduced the two diﬀerent access methods used in our system. For accessing all element nodes with name E using a document index, we ﬁrst have to reach the left-most data page by descending the tree structure, therefore, we have to access hd pages. Next, we have to scan every data page to ﬁnd all element nodes E. Let Pd be the number of data pages allocated by document d, then we deﬁne the costs for accessing all element nodes E using the document index as4 : Costaccessdix (E) = hd + (Pd − 1) +w ∗ Cardd (E) IO cost

CPU cost

On the other hand, accessing all elements with name E using an element index requires use ﬁrst to ﬁnd the corresponding key in the name directory (hnd ). Next, we have to reach the left-most data page holding the corresponding records by descending the node-reference index (hnr ). Finally, we need to scan all data pages (Pe ) containing E-node records. This results in the following cost formula: Costaccesseix (E) = hnd + hnr + (Pe − 1) +w ∗ Cardd (E) IO cost 4

CPU cost

Please recall, the function Cardd (E) returns the total number of occurrences of element E in document d.

158

A.M. Weiner and T. H¨ arder Table 2. Cost formulae for StackTree and TwigOptimal Scenario k ∈ {1, 2} l ∈ {3, 4} k ∈ {1, 2} l ∈ {3, 4}

Cost formula CostST (Ei , Ej , σd ) = [Costaccess (Ei ) + Costaccess (Ej )] · fsk (σd )−1 CostST (Ei , Ej , σd ) = [Costaccess (Ei ) + Costaccess (Ej )] · fsl (σd ) CostTO (Ei , Ej , σd ) = [Costaccess (Ei ) + Costaccess (Ej )] · fsk (σd ) CostTO (Ei , Ej , σd ) = [Costaccess (Ei ) + Costaccess (Ej )] · fsl (σd )−1

In Sect. 3.2, we had a closer look at the relative execution time of StackTree and TwigOptimal. Based on this analysis, we can now derive the corresponding cost formulae. We assume for both operators a sort-merge join semantics, i. e., we estimate the costs for evaluating a structural predicate by the sum of access costs. To pay attention to additional computational overhead resulting from the maintenance of stacks (StackTree and TwigOptimal) and cursor movements (TwigOptimal), we multiply these sums by correcting weights. For both operators, functions fs1 , . . . , fs4 provide the weights for the four scenarios. Table 2 shows the deﬁnitions of the relative CPU costs for StackTree and TwigOptimal. Using these formulae, we can now generalize our approach for estimating the costs of complete path expressions. Instead of using the total access costs (Costaccess (E)) for the calculation of access costs of inner-operator-tree SJ or HTJ operators, we use estimated cardinalities provided by an existing cardinality estimation framework, e. g., [10, 13, 1]. For every location step or predicate step consisting of element nodes E1 , . . . , En (and corresponding selectivities σd1 , . . . , σdn−1 ) of a path expression, we apply the appropriate cost formulae from Table 2. By adding up the numbers, we get the total CPU costs for StackTree (CostCPU,ST ) and TwigOptimal (CostCPU,TO ), respecn−1 tively: CostCPU,ST (E1 , . . . , En , σd1 , . . . , σdn−1 ) = CostST (Ei , Ei+1 , σdi ) n−1 i=1 and CostCPU,TO (E1 , . . . , En , σd1 , . . . , σdn−1 ) = i=1 CostTO (Ei , Ei+1 , σdi ).

5

Empirical Evaluation

To evaluate the quality of our empirically-derived cost model and to show its independence of various document characteristics (e. g., degrees of recursion or regularity), we performed several experiments using the same hardware setup as described in Sect. 3.1. For all experiments, we employed functions fs1 , . . . , fs4 for calculating the correcting weights for the cost formulae. The results presented are average values over ﬁve runs on a cold database buﬀer. For every query and both operators, we used element index accesses to provide the required element nodes for query evaluation. For all experiments, we used pre-calculated statistical information on cardinalities and selectivities. In average, all queries required between 30 and 120 milliseconds for optimization. In query optimization, cost models can be used for two diﬀerent classes of optimization strategies: top-down and bottom-up strategies. Top-down strategies like Simulated Annealing [7] always compare two query execution plans (QEPs)

Using Structural Joins and Holistic Twig Joins

159

w. r. t. their global costs. Whereas bottom-up strategies like the seminal dynamic programming algorithm of System R [12] make a local optimality assumption and consider only alternative subtrees with the lowest cost as building blocks for consecutive levels in the QEP, i. e., locally expensive subtrees are pruned. For every query, we enumerated and evaluated all possible join orders5 including (partially-)fused subtrees with SJ and HTJ operators. The vertical lines in Fig. 6 and 7 illustrate the spectrum of possible runtimes bounded by the fastest (Best Plan) and the slowest plan (Worst Plan). For the top-down scenario, we ranTable 3. XMark workload domly choose a QEP (Random Plan) and Name Query compared it with the cheapest plan estiX0 //category/description/text mated by the the cost model (Cheapest X1 //category/description//text Plan) presented in Sect. 3. In bottomX2 //category//parlist/listitem X3 //category//parlist//text up scenarios, plans denoted as ReorderX4 //description/text/bold ing and Join Fusion are the cheapest X5 //europe/item[.//mail] X6 //person[.//age]/creditcard plans according to the cost model, when X7 ///person[.//age][.//education] the query optimizer is enabled to perform X8 //listitem[parlist/listitem] X9 //listitem[parlist//text] cost-based SJ reordering and join fusion6 . X10 //category[.//listitem]/name On the other hand, if the optimizer is only X11 //category[.//listitem][.//text] X12 //namerica/item/mailbox permitted to perform cost-based join fuX13 //namerica/item//mail sion, the cheapest plans are referred to as X14 //annotation//description/parlist X15 //annotation//description//listitem Only Join Fusion. Finally, if the query opX16 //person[homepage]/name timizer could perform cost-based SJ reX17 //person[phone][.//province] X18 //person[.//watch]/name ordering and join fusion and was throwX19 //person[.//watch][.//province] ing a dice at each local optimization step, X20 //namerica/item[location] X21 //person[proﬁle][.//interest] the resulting plans are called RandomX22 //item[.//mail/to] decision Plans. X23 //item[.//mail][.//text] X24 //people/person/emailaddress First, we evaluated 36 diﬀerent path exX25 //people/person//street pressions (Table 3) on a 100MB XMark X26 //open auction//bidder/date X27 //site//open auction//increase document [11] (scaling factor 1.0) with X28 //mailbox/mail/text varying combinations of low (σd ≤ 0.33), X29 //closed auction[annotation]//text X30 //site//mail/text medium (0.33 < σd ≤ 0.67), and high X31 //closed auction[.//description]//text (0.67 < σd ≤ 1.0) selectivities as well X32 //people/person/name X33 //people/person[.//street] as with diﬀerent combinations of / and X34 //site//closed auction/annotation // axes. Figure 6(a) and 6(b) illustrate X35 //parlist[.//text]//keyword the corresponding experimental results. In top-down scenarios (Fig. 6(a)), the cost model never recommended plans that turned out to be the worst plan after execution. In average, cheapest plans took 1.49 times longer than best plans but, in turn, worst plans lasted on average 5

6

Please note, for the nasa, treebank, and psd queries, we randomly choose only ﬁve QEPs. Consequently, Fig. 7(a) and 7(b) show only the best and worst plan in this interval which may be local minima resp. maxima. This strategy fuses as much as possible adjacent structural join operators to a complex n-way join operator, i. e., as long as the usage of the TwigOptimal is cheaper than StackTree and only uses StackTree for remaining parts of the query.

A.M. Weiner and T. H¨ arder

[Best Plan, Worst Plan] Left-deep SJ Single HTJ Cheapest Plan Random Plan

1000

[Best Plan, Worst Plan] Reordering and Join Fusion Only Join Fusion Random-decision Plan Average Query Evaluation Time [ms]

Average Query Evaluation Time [ms]

160

100

10

1000

100

10 X0

X5

X10

X15 X20 Query

X25

(a) Top-down scenarios

X30

X35

X0

X5

X10

X15 X20 Query

X25

X30

X35

(b) Bottom-up scenarios

Fig. 6. Empirical results for the XMark workload

4.6 times (6.3 times) longer than cheapest (best) plans. With a probability of 81 %, the cheapest plan is faster than the randomly selected plan. Furthermore, randomly selected plans require on average 2.22 times more time for execution than cheap plans.Figure 6(a) also shows how SJ plans without optimization (Leftdeep SJ ) and HTJ-only plans (Single HTJ ) made their way. Left-deep SJ plans took on average 5.1 times longer than best plans, therefore, they turned out to be very expensive. On the other hand, single HTJ plans – as expected – performed very good (they took on average 2.11 times longer than best plans). Nevertheless, for queries X3 and X30 , they were the worst choices. In summary, our cost model chooses in top-down scenarios in almost all cases an at least near-optimal QEP and fulﬁls its major goal – cutting oﬀ worse plans – very well. Figure 6(b) shows the experimental results for bottom-up scenarios. Plans found using cost-based SJ reordering and join fusion (Reordering and Join Fusion) require 2.59 times more time than best plans. If we omit join reordering (Only Join Fusion), we can improve this factor to 2.51. In contrast, worst plans take on average 3.65 times (3.84) longer than optimal plans according to the cost model. Even though the local optimality assumption does not always hold, which is also true for relational optimizers, an optimized plan is with a probability of nearly 60 % faster than a so-called Random-decision Plan, where at each stage of the optimization process a subtree is randomly chosen out of n alternatives where n is in most cases larger than 2. Since bottom-up strategies – compared to top-down strategies – explore the complete search space, the cost model provides a fairly well means for pruning expensive sub-plans and does not choose an extremely expensive QEP with a probability of 75 %. In rare cases, the cost model recommended a bad QEP, e. g., for query X5 . In this case, the local optimality assumption did not hold and constrained the optimizer to take the wrong turn – a phenomenon that is also common in relational query optimization. The second experiment encompasses more complex queries on the regular nasa document (≈ 26 MB), the highly recursive treebank document (≈ 82 MB), and the very large psd7003 document (≈ 680 MB)7 . Table 4 shows the workload 7

For further information on these documents, see: http://www.cs.washington.edu/ research/xmldatasets/

Using Structural Joins and Holistic Twig Joins

161

Table 4. Nasa, treebank, and psd workload Name N0 N1 N2 T0 T1 T2 P0 P1 P2

Query //history[ingest]//revision/creator //datasets[.//descriptions/details]//observatory //datasets/dataset/tableHead//field/definition/footnote /FILE/PP//NP //SBAR//S//VP/PP//NP //X//SINV[.//NP]/S/VP//NP//NN //reference//xrefs/xref/db //ProteinEntry[header][accession]/reference/refinfo //ProteinEntry[.//refinfo/authors/author][genetics/gene/db]/organism

[Best Plan, Worst Plan] Cheapest Plan Random Plan

[Best Plan, Worst Plan] Reordering and Join Fusion Only Join Fusion Random-decision Plan

100000 Average Query Evaluation Time [ms]

Average Query Evaluation Time [ms]

100000

10000

1000

100

10000

1000

100

10

10 N0

N1

N2

T0

T1

T2

P0

P1

P2

Query

(a) Top-down scenarios

N0

N1

N2

T0

T1

T2

P0

P1

P2

Query

(b) Bottom-up scenarios

Fig. 7. Empirical results for nasa, treebank, and psd7003 workloads

for the three documents referred to as N0 –N2 , T0 –T2 , and P0 –P2 . Figure 7 illustrates the corresponding experimental results. In average, the worst plan required 18.29 times more time for execution than the best plan in the sample. In top-down scenarios, the cheapest plan was in 8 out of 9 cases faster than a randomly selected plan. The cheapest plan took on average 1.18 times longer than the best plan, on the other hand, worst plans took on average 49.09 times longer than the corresponding cheapest plans. It is worth to note that the worst plan for query T0 needed 67 seconds for execution compared to 0.19 seconds for its cheapest alternative. For the bottom-up scenarios under test, an optimized plan was in 7 out of 9 cases faster than a Random-decision Plan. In average, a query plan optimized using Reordering and Join Fusion resp. Only Join Fusion required 1.28 resp. 1.27 times longer than the best plan in the sample. Moreover, worst plans needed on average 47.11 resp. 51.31 longer than optimized plans.

6

Conclusions and Future Work

Path expressions are an important concept in prominent XML query languages like XPath or XQuery for structurally qualifying (sub-)trees in XML documents

162

A.M. Weiner and T. H¨ arder

for further processing. In this work, we introduced a set of cost formulae for two important SJ and HTJ operators that support an XML query optimizer in ﬁnding an at-least near optimal operator ﬁtting for path expressions. It is true that our cost model is empirically derived using a single XML document. Nevertheless, the experiments revealed that (1) our cost formulae describe document-independent characteristics of StackTree and TwigOptimal, and (2) the cost model is not tailored to the document used for the cost analysis. We showed that we can proﬁtably use our cost model for optimizing queries on completely diﬀerent documents (XMark, nasa, and psd7003 ) – even for the most exotic ones like the highly recursive treebank document. Since a similar analysis can be done for an arbitrary native XDBMS, our approach is system-independent and can work together with arbitrary cardinality-estimation frameworks. Even though we do not accomplish (and probably will never) a prediction rate of 100 %8 , our cost formulae can be used as a default setting for the cost model of an XML query optimizer. Moreover, we believe that it can be further reﬁned using statistical learning techniques or feedback-loop approaches well-known from relational query optimizers. Our future work will focus on the comparison of further physical join operators and index access operators and the development of appropriate cost formulae for a complete cost-based XML query optimization framework [14].

References 1. Aguiar Moraes Filho, J., H¨ arder, T.: EXsum—An XML Summarization Framework. In: Proc. IDEAS Conference, pp. 139–148 (2008) 2. Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural Joins: A Primitive for Eﬃcient XML Query Pattern Matching. In: Proc. ICDE Conference, pp. 141–154 (2002) 3. Balmin, A., Eliaz, T., Hornibrook, J., Lim, L., Lohman, G.M., Simmen, D.E., Wang, M., Zhang, C.: Cost-based Optimization in DB2 XML. IBM Systems Journal 45(2), 299–320 (2006) 4. Bruno, N., Koudas, N., Srivastava, D.: Holistic Twig Joins: Optimal XML Pattern Matching. In: Proc. SIGMOD Conference, pp. 310–321 (2002) 5. H¨ arder, T., Haustein, M.P., Mathis, C., Wagner, M.: Node Labeling Schemes for Dynamic XML Documents Reconsidered. Data & Knowledge Engineering 60(1), 126–149 (2007) 6. Haustein, M., H¨ arder, T.: An Eﬃcient Infrastructure for Native Transactional XML Processing. Data & Knowledge Engineering 61(3), 500–523 (2007) 7. Ioannidis, Y.E., Wong, E.: Query Optimization by Simulated Annealing. In: Proc. SIGMOD Conference, pp. 9–22 (1987) 8. Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic Twig Joins on Indexed XML Documents. In: Proc. VLDB Conference, pp. 273–284 (2003) 9. McHugh, J., Widom, J.: Query Optimization for XML. In: Proc. VLDB Conference, pp. 315–326 (1999) 8

Surprisingly, no success ﬁgures are published so far for prediction rates, even for those of the much simpler relational cost models.

Using Structural Joins and Holistic Twig Joins

163

10. Polyzotis, N., Garofalakis, M.N.: Structure and Value Synopses for XML Data Graphs. In: Proc. VLDB Conference, pp. 466–477 (2002) 11. Schmidt, A., Waas, F., Kersten, M.L., Carey, M.J., Manolescu, I., Busse, R.: XMark: A Benchmark for XML Data Management. In: Proc. VLDB Conference, pp. 974–985 (2002) 12. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access Path Selection in a Relational Database Management System. In: Proc. SIGMOD Conference, pp. 23–34 (1979) 13. Wang, W., Jiang, H., Lu, H., Yu, J.X.: Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In: Proc. VLDB Conference, pp. 240–251 (2004) 14. Weiner, A.M.: Framework-Based Development and Evaluation of Cost-Based Native XML Query Optimization Techniques. Appears in: Proc. VLDB PhD Workshop (2009) 15. Weiner, A.M., Mathis, C., H¨ arder, T.: Rules for Query Rewrite in Native XML Databases. In: Proc. EDBT DataX Workshop, pp. 21–26 (2008) 16. Wu, Y., Patel, J., Jagadish, H.: Structural Join Order Selection for XML Query Optimization. In: Proc. ICDE Conference, pp. 443–454 (2003) 17. Zhang, N., Haas, P.J., Josifovski, V., Lohman, G.M., Zhang, C.: Statistical Learning Techniques for Costing XML Queries. In: Proc. VLDB Conference, pp. 289–300 (2005)

Approximate Rewriting of Queries Using Views Foto Afrati1 , Manik Chandrachud2 , Rada Chirkova2, and Prasenjit Mitra3 1

3

National Technical University of Athens [email protected] 2 Computer Science Department, NC State University, Raleigh, NC USA [email protected], [email protected] College of Information Sciences and Technology, Pennsylvania State University, University Park, PA USA [email protected] Abstract. We study approximate, that is contained and containing, rewritings of queries using views. We consider conjunctive queries with arithmetic comparisons (CQACs), which capture the full expressive power of SQL select-project-join queries. For contained rewritings, we present a sound and complete algorithm for constructing, for CQAC queries and views, a maximally-contained rewriting (MCR) whose all CQAC disjuncts have up to a predetermined number of view literals. For containing rewritings, we present a sound and eﬃcient algorithm pruned-MiCR, which computes a CQAC containing rewriting that does not contain any other CQAC containing rewriting (i.e., computes a minimally containing rewriting, MiCR) and that has the minimum possible number of relational subgoals. As a result, the MiCR rewriting produced by our algorithm may be very eﬃcient to execute. Both algorithms have good scalability and perform well in many practical cases, due to their extensive pruning of the search space, see [1].

1

Introduction

Rewriting queries using views and then executing the rewritings to answer the queries is an important technique used in data warehousing, information integration, query optimization, and other applications, see [2, 3, 4, 5, 6] and references therein. A large amount of work has been done on obtaining equivalent rewritings of queries, that is, rewritings that can be used to derive exact query answers (see, e.g., [7, 8, 9]). When equivalent rewritings cannot be found, then in many applications it makes sense to work with contained rewritings, which return a subset of the set of the query answers. Of special interest in this context are maximally contained rewritings (MCRs), which can be used to obtain a maximal subset of the query answers that can be obtained using the given views (see, e.g., [10, 4, 11, 12, 13]). In addition, in applications such as querying the World-Wide Web, mass marketing, searching for clues related to terrorism suspects, or peer data-management systems (see, e.g., [14, 15]), users prefer to get a superset of the query answers, rather than getting no answers at all (when no equivalent or contained rewritings exist). In such scenarios, users might be interested in containing rewritings, which return a superset of the set of the query J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 164–178, 2009. c Springer-Verlag Berlin Heidelberg 2009

Approximate Rewriting of Queries Using Views

165

answers. Minimally containing rewritings (MiCRs) [16, 17, 18] are the containing rewritings that return the fewest false positives when answering the query. In this paper we study maximally contained and minimally containing rewritings of queries using views, which we refer to collectively as approximate rewritings. We focus on conjunctive queries with arithmetic comparisons (CQACs), that is on the language capturing the full expressive power of practically important SQL select-project-join (SPJ) queries. (The well-understood language of conjunctive queries [19] does not capture the in- or non-equalities that are characteristic of SQL SPJ queries.) Speciﬁcally, we assume CQAC queries and views, and consider CQAC rewritings, possibly with unions (UCQACs). The well-studied (for conjunctive queries and views) problems of ﬁnding equivalent rewritings and MCRs are recognized as being signiﬁcantly more complex for CQACs, with many practically important cases still unexplored [4, 13]. The complexity of the problems in the presence of ACs is mainly due to the more complex containment test — the containment test is NP-complete in the case of CQs [19] but Π2P -complete [4, 20] in the case of CQACs. We illustrate the challenge by an example from [12]. Example 1. Consider CQAC query Q and CQAC view V , both deﬁned using binary predicate p, as well as a CQAC query R deﬁned in terms of the view V . Let Q() :- p(A, B), A ≤ B; V () :- p(X, Y ), p(Y, X); and R() :- V (). Here, R is a contained rewriting of Q; this containment can be veriﬁed using the containment tests of [4, 20] (see Sect. 2). Observe that the containment cannot be established using a single containment mapping [19] from Q to the expansion of R. Some of the authors of this paper presented in [9] a sound and complete algorithm that returns a UCQAC equivalent rewriting of the input CQAC query in terms of the input CQAC views. In this paper we focus on those problem settings where one is to ﬁnd a rewriting of a given CQAC query in terms of given CQAC views, but an equivalent UCQAC rewriting does not exist, and thus the algorithm of [9] returns no answer. Further, Deutsch, Ludaescher, and Nash [16] provided approaches for solving the problem of rewriting queries using views with limited access patterns under integrity constraints, focusing on queries, views, and constraints over unions of conjunctive queries with negation. We comment on the contributions of [16] w.r.t. the algorithms that we propose in this paper when discussing our speciﬁc contributions. The speciﬁc contributions presented in this paper are as follows: 1. Contained rewritings: Pottinger and Halevy developed algorithm MiniCon IP [12], which eﬃciently ﬁnds UCQAC MCRs for special cases of CQAC queries, views, and rewritings, speciﬁcally for those cases where the “homomorphism property” [21, 22] holds between the expansions of the rewritings and the query.1 At the same time, MiniCon IP cannot ﬁnd the rewriting R 1

The homomorphism property is said to hold between CQAC queries Q1 and Q2 when a single mapping from Q1 to Q2 establishes the containment of Q2 in Q1 ; see Sect. 2.

166

F. Afrati et al.

for the problem input of Example 1. We present a sound and complete algorithm called Build-MaxCR, for constructing a UCQAC size-limited MCR (that is, an MCR that has up to a predetermined number of view literals) of arbitrary CQAC queries using arbitrary CQAC views.2 The sizelimit restriction of Build-MaxCR is due to the fact that for CQAC queries and views, a view-based UCQAC MCR may have an unbounded number of CQAC disjuncts, see Example 2. To the best of our knowledge, the approaches of [16] do not provide for constructing size-limited contained rewritings of the input queries using views, which are addressed by our algorithm Build-MaxCR. 2. Containing rewritings: We focus on the problem of enabling a MiCR of a CQAC query using CQAC views to be executed as eﬃciently as possible. To that end, we look at minimizing the number of relational subgoals of a given MiCR. Our main contribution is a sound and eﬃcient algorithm that we call pruned-MiCR. Given a CQAC MiCR for a given problem input (CQAC query and views), pruned-MiCR performs global minimization of the MiCR, and in many cases produces MiCR formulations whose evaluation costs are signiﬁcantly lower than those of the (MiCR) input to the algorithm. To the best of our knowledge, other approaches for MiCRs [16, 17, 18] do not involve minimization of the number of relational subgoals of the MiCRs. 3. Reducing runtime of containment checking: Finally, we study the problem of reducing the runtime of containment checking between two CQAC queries, and propose a runtime-reduction technique that takes advantage of some attributes drawing values from disjoint domains. (Intuitively, it does not make sense to compare the values of, e.g., attributes “price” and “name”.) This technique can be used in a variety of algorithms. Speciﬁcally, it is applicable to our proposed algorithms Build-MaxCR and pruned-MiCR. Due to the space limit, this result (as well as our NP-completeness result for the problem of determining whether a CQAC containing rewriting exists for a given CQAC problem input, see Table 1) is omitted from this paper but can be found in the full version [1] of our paper, available online. Table 1 gives a summary of our results and contributions. Due to the space limit, we present here only a foundational exposition of our algorithms. The full version [1] of this paper provides all the details as well as our experimental results. While the running-time complexity of our proposed algorithms is high in the worst case (doubly exponential for algorithm BuildMaxCR, and singly exponential for algorithm pruned-MiCR), our experimental results indicate that both algorithms have good scalability and perform well in many practical cases, due to their extensive pruning of the search space. Related Work The problem of using views in query answering [7] is relevant in applications in information integration [4], data warehousing [10], web-site design [23], and query 2

Speciﬁcally, Build-MaxCR can ﬁnd the rewriting R of Example 1.

Approximate Rewriting of Queries Using Views

167

Table 1. Our contributions, previous work, and applications Contained Rewritings Containing Rewritings UCQAC size-limited MCR UCQACs with negation [16] for CQACs Complexity CQ: NP [7] CQAC homomorphism property: NP-complete [1] Algorithms Size-limited UCQAC MCRs Global minimization of MiCR for for CQACs CQACs Previous Work MCR [10, 13] MiCR [16, 17, 18] Applications Data warehousing, security, Mass marketing, P2P, inforprivacy mation retrieval Decidability

optimization [6, 7, 24]. Algorithms for ﬁnding rewritings of queries using views include the bucket algorithm [17, 25], the inverse-rule algorithm [26, 27, 28], the MiniCon algorithm [12], and the shared-variable-bucket algorithm [11]; see [10] for a survey. Almost all of the above work focuses on investigating MCRs or equivalent rewritings [4, 8], as it takes its motivation mostly from information integration and query optimization. Query-rewriting algorithms depend upon eﬃcient algorithms for checking query containment. Existing work on query containment show that adding arithmetic comparisons to queries and views makes these problems signiﬁcantly more challenging [29, 21, 20]. Since we consider rewritings that may return false positives or false negatives, our work has similarities with approximate answering of queries using views, see [30, 31, 32, 33] and references therein, as well as a detailed discussion in [1]. Approximate query answering is useful when exact query answers cannot be found, and the user would rather have a good-quality approximate answer returned by the system. Our approaches provide such approximate answers in the form of maximally contained or containing rewritings. The problem of ﬁnding containing rewritings of queries using views has been studied in [17] in [16, 18]. Please see the beginning of Sect. 1 for a detailed discussion of the work of [16]. Other related work includes the results of Rizvi et al. [34], where queryrewriting techniques are used for access control, and the work of Miklau et al. [35], which contains a formal probabilistic analysis of information disclosure in data exchange under the assumption of independence among the relations and data in a database. Related work in security and privacy includes [36]. Calvanese et al. [37] discussed query answering, rewriting, and losslessness with respect to two-way regular path queries. In our work, we concentrate only on query rewritings.

2

Preliminaries

In this section we review some standard concepts related to answering queries using views, and introduce some notation that we will use throughout the paper.

168

2.1

F. Afrati et al.

Queries, Containment, and Views

We consider conjunctive queries with arithmetic comparisons (CQACs), that is, SQL select-project-join queries with equality and arithmetic-comparison selection conditions. Each arithmetic comparison (AC) subgoal is of the form X θ Y or X θ c,3 where the comparison operator θ is one of <, ≤, >, ≥, and =. We assume that database instances are over densely totally ordered domains. A variable is called distinguished if it appears in the query head. In the rest of the paper, for a query Q we denote the conjunction of all relational subgoals in Q as Q0 and the conjunction of all ACs in Q as β. All the queries we consider are safe, that is each distinguished variable or variable appearing in the β of the query also appears in at least one relational subgoal of the query. Definition 1 (Query containment). A query Q1 is contained in a query Q2 , denoted Q1 Q2 , if and only if, for all databases D, the answer Q1 (D) to Q1 on D is a subset of the answer Q2 (D) to Q2 on D, that is, Q1 (D) ⊆ Q2 (D). Chandra and Merlin [19] have shown that a CQ Q1 is contained in another CQ Q2 of the same (head) arity if and only if there exists a containment mapping from Q2 to Q1 . The containment mapping is a (body) homomorphism h from the variables of Q2 to the variables and constants of Q1 and from the constants of Q2 to themselves, that is for each subgoal p(Z1 , . . . , Zn ) of Q2 it holds that p(h(Z1 ), . . . , h(Zn )) is a subgoal of Q1 . In addition, for h to be a containment mapping from Q2 to Q1 , it must be that the list (X1 , . . . , Xk ) of the variables and constants in the head of Q1 be (h(Y1 ), . . . , h(Yk )) (that is, Xi = h(Yi ) for ∀i ∈ {1, . . . , k}), where Q2 (Y1 , . . . , Yk ) is the head of Q2 . The containment test for CQACs is more involved. There are two ways to test the containment of CQAC Q1 in CQAC Q2 [29, 21]. We describe them brieﬂy here; for more details see, e.g., [38]. The ﬁrst test uses the notion of a ¯ i ) in Q, a canonical database canonical database: For each relational subgoal pi (X for Q contains one tuple t in the base relation pi , such that t is the list of ¯ i (i.e., in forming t each variable in “frozen” variables and constants from X ¯ Xi is “frozen” to a unique constant except that equated variables are frozen ¯ i is kept as it is). We deﬁne one to the same constant and each constant in X canonical database for each total ordering of the variables and constants in Q1 that satisﬁes the ACs in Q1 . The test says that Q1 is contained in Q2 if and only if Q2 computes, on all the canonical databases of Q1 , all the head tuples of Q1 . The second containment test, see Theorem 1, uses the notion of a normalized version of a CQAC query. An equivalent normalized version [21, 39] Q of a CQAC query Q does not have constants or repetitions of variable names in relational subgoals and has compensating built-in equality conditions. Theorem 1. For CQAC queries Q1 and Q2 , Q1 Q2 iﬀ implication φ holds: φ : β1 ⇒ μ1 (β2 ) ∨ . . . ∨ μk (β2 ) 3

We use uppercase letters to denote variables and lowercase letters for constants.

Approximate Rewriting of Queries Using Views

169

where μi ’s are all the containment mappings from Q2 to Q1 and βi is a conjunction of all the ACs in Qi , i ∈ {1, 2}. That is, the ACs in the normalized version Q1 of Q1 logically imply (denoted “⇒”) the disjunction of the images of the ACs of the normalized version Q2 of Q2 under each mapping μi . If there exists a containment mapping μi such that the right-hand side of φ is reduced to only one μi (β2 ), then we say the homomorphism property holds between Q1 and Q2 . Afrati et al. [38] showed that when the homomorphism property holds, the implication can be checked on the queries without normalizing them. Checking CQAC containment is less complex in that case, because we need to check for the existence of just one mapping that satisﬁes the implication. 2.2

Rewriting Queries Using Views

We consider the problem of ﬁnding rewritings under the closed-world assumption (CWA) [8], where for a given database, each view instance stores exactly the tuples satisfying the view deﬁnition. In addition, we consider contained rewritings under the open-world assumption (OWA) [8, 25]. Here, the views are sound but not necessarily complete, that is a view instance might store only some of the tuples satisfying the view deﬁnition. Suppose we are looking for an answer to query Q on database D, and our access to D is restricted to using a set of views V = {V1 , . . . , Vm }. So instead of directly evaluating Q on D, we rewrite Q in terms of V and then evaluate the rewriting on D. We consider the following types of rewritings R of Q using V. Here, DV is the result of adding to database D the answers to views V on D. Definition 2. (Rewritings) 1. a. (CWA) R is a contained rewriting of Q using V under the CWA iﬀ R(DV ) ⊆ Q(D) for all databases D. b. (OWA) R is a contained rewriting of Q using V under the OWA iﬀ R(IV ) ⊆ Q(D) for all databases D and view instances IV such that IV ⊆ DV . 2. (CWA) R is a containing rewriting of Q using V iﬀ ∀ D : Q(D) ⊆ R(DV ). 3. (CWA) R is an equivalent rewriting of Q using V iﬀ ∀ D : Q(D) = R(DV ). Since the answer to a containing rewriting R on a database D must contain all tuples that occur in the answer to Q on D, containing rewritings make sense only when the views that are used in constructing the containing rewriting are complete. Hence, containing rewritings are considered only under the CWA and not under the OWA. The same is true for equivalent rewritings, since an equivalent rewriting of Q is a rewriting that is a contained as well as a containing rewriting of Q. At the same time, since the result of a contained rewriting is allowed to leave out some of the answers to Q, contained rewritings make sense under the CWA and under the OWA. Given a query Q and a set of views V, for deciding whether there exists a contained (or containing) rewriting of Q using V, we need to know the language in which we are allowed to construct rewritings. In the rest of the paper we

170

F. Afrati et al.

assume, unless otherwise stated, that the language of the rewritings for the existence problem is UCQACs. We deﬁne the expansion of a rewriting as follows: Definition 3 (Expansion of rewriting). For a CQAC rewriting R that is expressed in terms of CQAC views V, an expansion Rexp of R is obtained by replacing each view subgoal in R by the all the subgoals in the deﬁnition of that view. Each existentially quantiﬁed variable in the deﬁnition of a view in R is replaced by a unique variable in Rexp . For a UCQAC rewriting, the expansion is the union of the expansions of the CQACs that occur in that UCQAC. The evaluation of contained rewritings cannot return false positives, the evaluation of containing rewritings cannot return false negatives, and the evaluation of equivalent rewritings cannot return either false positives or false negatives. We will use the term rewriting to mean a contained or a containing rewriting; we will specify the type whenever it is not obvious from the context. Theorem 2 is based on Deﬁnitions 2 and 3 and gives the tests for determining whether a CQAC rewriting R is a contained (or containing) rewriting of a CQAC query Q using CQAC views V. Theorem 2. Let Q, V1 , . . . , Vm be CQAC queries deﬁned on database schema D, and let R be a CQAC rewriting of Q using {V1 , . . . , Vm }. Then 1. R is a contained rewriting of Q if and only if Rexp Q. 2. R is a containing rewriting of Q if and only if Q Rexp .

3

Algorithm Build-MaxCR: Finding MCRs for CQACs

In this section we present a sound and complete algorithm Build-MaxCR, for constructing a UCQAC size-limited maximally-contained rewriting (i.e., MCR with up to a predetermined number of view literals) of CQAC queries using CQAC views. We discuss the pseudocode and formulate the correctness results for the algorithm. These results resolve in the positive the problem of decidability of the existence of a UCQAC size-limited MCR for CQAC queries and views. 3.1

The Setting and Definitions

Suppose we are given a CQAC query Q and a set V of CQAC views, such that each of R1 and R2 is a CQAC contained rewriting of Q using V. It is easy to see that the union R1 ∪ R2 is also a contained rewriting of Q using V. This observation motivates us to consider the language of unions of CQAC queries for maximally contained rewritings of CQAC queries using CQAC views. Given a CQAC query Q and a set V of CQAC views, a UCQAC contained rewriting R of Q using V is a maximally-contained rewriting (MCR) of Q using V in the language of UCQACs if for each UCQAC contained rewriting R of Q using V it holds that (R )exp Rexp . The ﬁrst question we examine is whether such a UCQAC MCR is always bounded in size. Consider an example based on the ideas from [22].

Approximate Rewriting of Queries Using Views

171

Example 2. Let query Q and views V1 and V2 be deﬁned as follows. Let Q() : − p(X, Y ), p(Y, Z), s(Y ), X ≥ 2, Z ≤ 7; let V1 (L, M ) : − p(L, M ), L ≥ 2, M ≤ 7; and let V2 (A, C) : − p(A, B), p(B, C), s(A), s(C). We can show that each of R3 and R4 is a CQAC contained rewriting of Q using V1 and V2 . Here, R3 () : − V1 (L1 , A1 ), V2 (A1 , C1 ), V1 (C1 , M2 ); and R4 () : − V1 (X, T1 ), V2 (T1 , T2 ), V2 (T2 , T3 ), V1 (T3 , Z). Further, one can use the template of R3 and R4 to build rewritings R5 (which has one extra V2 subgoal as compared to R4 ), R6 (two extra V2 subgoals), and so on. (See [1] for the details.) In the family of rewritings R = {R3 , R4 , R5 , R6 , . . . , } that we build in this manner, each rewriting Ri (for i ≥ 3) has two properties: – the expansion of Ri is contained in Q, and – Ri (for i > 3) is not contained in Rj for any 3 ≤ j < i. Therefore, a UCQAC maximally contained rewriting of Q in terms of {V1 , V2 } must include every Ri in the inﬁnite-cardinality family R. The point of Example 2 is that the number of CQAC disjuncts (such as Ri ’s in the example) in the maximally-contained UCQAC rewriting of a CQAC query using CQAC views may not be bounded, provided that the language of rewritings is UCQAC. Hence an algorithm for ﬁnding the UCQAC-MCR may not terminate on some CQAC inputs. To address this problem, we introduce the concept of size-limited MCRs. Speciﬁcally, we deﬁne the problem of constructing a UCQAC size-limited MCR for a CQAC query using CQAC views. We use the following deﬁnition: Definition 4 (A k-bounded (CQAC, UCQAC) query). Given a database schema V and a positive integer number k. (1) A CQAC query Q deﬁned on V is a k-bounded (CQAC) query using V if, for the number n of relational subgoals of Q, we have n ≤ k. (2) Q = i Qi is a k-bounded UCQAC query using V if each CQAC component Qi of Q is a k-bounded query using V. Now, the problem of constructing a UCQAC size-limited (k-bounded) MCR for a CQAC query using CQAC views is speciﬁed as follows: 1. The problem input is a triple (Q, V, k), where Q is a CQAC query, V is a ﬁnite set of CQAC views, and k is a natural number. 2. The problem output is a UCQAC query P = j Pj in terms of V, such that: (a) P exp is contained in Q, P exp Q; (b) P is a k-bounded (UCQAC) query in terms of V; and (c) for each k-bounded UCQAC query R in terms of V such that Rexp Q, we have that Rexp P exp . Our proposed algorithm Build-MaxCR solves the above problem for arbitrary inputs (Q, V, k) as deﬁned in the problem formalization. Our soundness and completeness results for Build-MaxCR (Sect. 3.3) establish that for each such input (Q, V, k), Build-MaxCR returns a maximally contained rewriting of Q in the language of k-bounded UCQAC queries over V, if such a rewriting exists.

172

3.2

F. Afrati et al.

Our Algorithm Build-MaxCR

We now discuss brieﬂy our algorithm Build-MaxCR, please see [1] for the pseudocode and examples. The general idea of the algorithm is to do a complete enumeration of the CQ parts, call them P¯j , of k-bounded CQAC queries deﬁned on schema V. (For a CQAC query R, we use the term “CQ part of R” to refer to the join of all relational subgoals of R, taken together with all the equality ACs implied by R.) For each such P¯j , the algorithm associates with P¯j a minimum set Sj of inequality/nonequality ACs on the variables and constants of P¯j , such that Sj ensures containment of P¯jexp &Sj in Q. The output for Build-MaxCR is the union P of all the CQAC queries P¯j &Sj for which the containment holds. (By [21], P¯jexp &Sj Q for each j ensures P exp Q, where P = j P¯j &Sj .) The algorithm uses the notion of a “CQAC-rewriting template” for a problem input (Q, V, k). For an input of this form, Build-MaxCR enumerates all cross products, call them Pi , of up to k relational subgoals in terms of V. We call each Pi , with s ≤ k subgoals, a CQAC-rewriting template (for Q) of size s. Another notion used by the algorithm is that of a “MaxCR canonical database.” Given query Q and its CQAC-rewriting template P (of some size s), the set DPQ of MaxCR canonical databases for Q and P is constructed in the same way as the set exp D(P ) of canonical databases of the expansion P exp of P (see Sect. 2). The only diﬀerence is that the set W of constants and variables of P exp (W is used in the exp construction of D(P ) ) is extended, for the construction of DPQ , to include also all the numerical constants of the query Q. 3.3

Correctness of Algorithm Build-MaxCR

We now formulate theorems that establish soundness and completeness of BuildMaxCR, as well as the decidability results for two decision versions of the problem of constructing UCQAC k-bounded MCRs for CQAC queries using CQAC views. The proofs of these correctness results for Build-MaxCR, as well as our experimental results that corroborate the eﬃciency and scalability of the algorithm, can be found in the full version [1] of the paper. Theorem 3 (Soundness of Build-MaxCR). For a Build-MaxCR problem input (Q, V, k), let P be a CQAC-rewriting template (of some size s ≤ k). Then for any CQAC query P :− P &S that is output by Build-MaxCR, (P )exp is contained in Q. (Here, S is a conjunction of ACs.) (For the notion of a “CQAC-rewriting template” for a Build-MaxCR problem input (Q, V, k), please see Sect. 3.2.) Theorem 4 (Completeness of Build-MaxCR). For a Build-MaxCR problem input (Q, V, k), let R be a UCQAC query deﬁned in terms of V, such that (i) in each CQAC component Ri of R, the number of relational subgoals of Ri does not exceed k, and (ii) Rexp Q. Then (1) the output of Build-MaxCR is not empty, and (2) denoting by P the UCQAC output of Build-MaxCR, we have that Rexp Pexp . By Theorems 3 and 4 we obtain immediately the following two results:

Approximate Rewriting of Queries Using Views

173

Theorem 5 (Decidability). Given a CQAC query Q, a set V of CQAC views, and a natural number k. (1) It is decidable to determine whether Q has a UCQAC k-bounded contained rewriting in terms of V. (2) Further, given in addition a UCQAC k-bounded query R deﬁned in terms of V, the problem of determining whether R is a UCQAC k-bounded MCR for Q using V is decidable.

4

Finding Minimized Minimally Containing Rewritings

We now turn to the problem of ﬁnding minimally containing rewritings [16, 17, 18], which we abbreviate as MiCRs, of a CQAC query using CQAC views. The word “minimal” in “MiCR” refers to a containing rewriting that contains the fewest false positives (in the given rewriting language) w. r. t. the query answer. We focus on the problem of enabling a MiCR of a CQAC query using CQAC views that can be executed eﬃciently. To that end, we look at minimizing the number of relational subgoals of a given MiCR, and thus the number of joins in the evaluation plans for the MiCR. In Sect. 4.1, we introduce the notion of a minimized MiCR. The main contribution of this section is an algorithm that we call pruned-MiCR, see Sect. 4.2. Given a CQAC MiCR for a given problem input (i.e., for a CQAC query and a set of CQAC views), pruned-MiCR globally minimizes the MiCR in an eﬃcient and scalable way. (See Sect. 4.3 for the correctness and complexity results.) Our experimental results [1] suggest that for many problem inputs (for the MiCRs for queries and views of certain types), pruned-MiCR outputs minimized MiCRs whose evaluation costs are signiﬁcantly lower than those of the (MiCR) input to the algorithm. Note that the idea of minimizing the number of subgoals in a rewriting is quite general and thus applicable beyond containing rewritings. Speciﬁcally, a straightforward modiﬁcation of pruned-MiCR could be used to reduce the number of relational subgoals of (and thus to provide more eﬃcient execution options for) the outputs of our Build-MaxCR algorithm of Sect. 3. See [1] for the details. 4.1

The Definitions

First, we provide a general deﬁnition of a MiCR and then we deﬁne (CQAC) minimized MiCRs. Definition 5 (Minimally containing rewriting). A query Q deﬁned in query language L1 is a minimally containing rewriting (MiCR) of a query Q deﬁned in language L2 using a set of views V deﬁned in language L3 if: (1) Q is a containing rewriting of Q in terms of V, and (2) there exists no containing rewriting (in language L1 ) Q of Q using V, such that the expansion of Q is properly contained in the expansion of Q . For the results in this section, each of L1 through L3 is the language of CQAC queries. We now deﬁne the notion of minimized MiCR.

174

F. Afrati et al.

Definition 6 (Minimized MiCR). Given a CQAC query Q and a set of CQAC views V, CQAC MiCR R of Q using V is a minimized (CQAC) MiCR of Q using V if removing any relational subgoal of R results in query R such that R and R are not equivalent as expansions, that is Rexp ≡ / (R )exp . By deﬁnition, if we delete even a single relational subgoal from a minimized MiCR, it no longer remains a MiCR. Finding minimized MiCRs is especially important where the MiCR is computed once and then executed repeatedly. In such cases, it is important that the MiCR execute eﬃciently. Since a minimized MiCR may have fewer relational subgoals than the original MiCR (see, e.g., Example 3), and thus fewer joins, such a performance improvement would have a signiﬁcant payoﬀ. We now introduce the notion of a “globally minimal” minimized MiCR. A globally minimal minimized CQAC MiCR for a CQAC query Q and set V of CQAC views has the minimum number of relational subgoals among all CQAC queries deﬁned using V that are equivalent (as expansions) to a (unique) CQAC MiCR for Q and V. A globally minimized MiCR may not be unique for a given (Q, V), please see example in the full version [1] of this paper. While we can show that two distinct minimized MiCRs for a given CQAC MiCR can have a diﬀerent number of relational (view) subgoals (see [1]), the minimized MiCR output by our algorithm pruned-MiCR is guaranteed to be a globally minimized MiCR, see Sect. 4.3 for the details. 4.2

Algorithm for Finding Minimized MiCRs

In this subsection, we present and discuss the pseudocode for our algorithm pruned-MiCR (Algorithm 2). The pseudocode of Algorithm 2 has two parts: (A) Lines 1 through 11 of the pseudocode present a “full-MiCR” algorithm that outputs a CQAC containing rewriting R of a given CQAC query Q using a given set V of CQAC views. The full version [1] of this paper contains the soundness and completeness result for this speciﬁc full-MiCR algorithm when applied to problem inputs such that the homomorphism property (see Sect. 2) holds between the expansion of the MiCR and the input query. (B) Lines 12 through 28 of the pseudocode present the pruned-MiCR algorithm that is the subject of the discussion in this section of the paper. Please note that the full-MiCR part (lines 1-11) of Algorithm 2 is not a contribution of this paper. It is given here just to provide the reader with a complete picture, speciﬁcally to indicate which MiCR-generating algorithm was used in our experimental results, see the full version [1] of this paper. We now outline the ﬂow of our proposed algorithm pruned-MiCR (lines 1228 of Algorithm 2). The algorithm accepts as its inputs a CQAC query Q, a set V of CQAC views, and a CQAC MiCR R of Q using V . First (lines 12-22 of the pseudocode) pruned-MiCR constructs buckets, one bucket to represent each (view subgoal, query subgoal) pair, where the views are drawn from the MiCR R, and the query is the input query Q. Suppose two view subgoals g1 and

Approximate Rewriting of Queries Using Views

Algorithm 1. Algorithm Pruned-MiCR Input : CQAC query Q, set of CQAC views V Output : Minimized MiCR of Q using views V begin { Construct the full MiCR (see [1] for a discussion): } 1. R ← null 2. for each view v in V do 3. for each containment mapping µi from the core subgoals in the body of v to Q do 4. Construct h(v) by replacing each distinguished variable in v with µi (V ) 5. ac ← null 6. ac view ← AC(h(v)) 7. for each aci ∈ AC(Q) do 8. if all variables in aci appear in h(v) then 9. ac ← ac ∧ aci 10. if AC(Q) ⇒ µi (ac view, ac)) then 11. Add h(v), ac to the rewriting R { pruned-MiCR begins here and ends on line 28:} { Construct buckets:} 12. for each core subgoal gr in R do 13. for each query subgoal gq that gr maps to do 14. Let B be the bucket representing gr , gq 15. ignore subgoal ← f alse 16. for each gv in B do 17. if gr strictly contains gv then 18. ignore subgoal ← true 19. else gv strictly contains gr 20. Delete gv from bucket B 21. if ignore subgoal is false then 22. Add gr to B { Now we have a set of buckets and a set of view subgoals covering each bucket. } 23. Run a minimum set cover algorithm to select a set of view subgoals such that all the buckets are covered. 24. Construct a rewriting by taking a conjunction of the selected views and their associated arithmetic predicates. 25. if candidate rewriting is contained in the full MiCR then 26. Output candidate rewriting 27. else 28. Output R { Output the MiCR that was the input of line 12. }. end

175

176

F. Afrati et al.

g2 both cover the same query subgoal and are candidates for the same bucket. Then, in case one of the view subgoals properly contains the other, we keep in the bucket the head homomorphism for the contained-view only; otherwise, both view heads are inserted into the bucket. Second (see line 23 of the pseudocode), a minimum set cover algorithm is run to select a subset of the view heads such that each bucket is covered. This set of view heads is used to form a candidate rewriting (line 24 of the pseudocode). Finally (lines 25-28 of the pseudocode), the algorithm checks whether the candidate rewriting is equivalent to the full (input) MiCR R, and outputs the candidate rewriting if the check succeeds. (In case of non-equivalence, pruned-MiCR outputs the full MiCR R.) Consider an illustration of the ﬂow of the algorithm. We will use the query, views, and CQAC MiCR R of the following example. Example 3. Let query Q and six views, V1 through V6 , be deﬁned as follows. Let Q(X, Z) : − p(X, Y, Y, X, X), s(Z, Z), Z < 3; let V1 () : − p(X1 , A1 , B1 , X1 , C1 ); let V2 (X2 ) : − p(X2 , A2 , B2 , X2 , C2 ); let V3 (X3 ) : − p(X3 , A3 , B3 , X3 , X3 ); let V4 (X4 ) : − p(C4 , A4 , B4 , X4 , X4 ); let V5 (B5 ) : − p(A5 , Y5 , Y5 , B5 , C5 ); and let V6 (Z6 ) : − s(Z6 , T6 ). It is possible to show that CQAC query R(X, Z) :- V1 (), V2 (X), V3 (X), V4 (X), V5 (X), V6 (Z), Z < 3. is a CQAC MiCR of Q using {V1 , . . . , V6 }. Our algorithm pruned-MiCR generates the following globally minimal minimized MiCR: R (X, Z) :- V3 (X), V5 (X), V6 (Z), Z < 3. In order to minimize the MiCR R to obtain R , algorthm pruned-MiCR retains the views that cover the query subgoals most tightly in the MiCR, and deletes views that cover no query subgoal tightly. Speciﬁcally, views V3 , V5 , and V6 should be retained in the MiCR but not the other views. Views V3 , and V5 cover the subgoal p(X, Y, Y, X, X) and do not contain each other. At the same time, consider view V2 . V2 contains view V3 and thus covers the query less tightly than view V3 . Hence, V2 should not be present in the minimized MiCR. 4.3

Correctness and Complexity of Pruned-MiCR

In this subsection we formulate the correctness and complexity results for our algorithm pruned-MiCR. The proofs and details can be found in [1]. Theorem 6 (Soundness of pruned-MiCR). Given a CQAC problem input (Q, V, R), where R is a CQAC MiCR for Q using V, let R be the (CQAC) output of algorithm pruned-MiCR. Then R is a globally minimized MiCR for Q and V whenever R and R are not isomorphic. As suggested by our experimental results, see full version [1] of this paper, for many problem inputs pruned-MiCR outputs minimized MiCRs R (that are not isomorphic to the pruned-MiCR input R, see Theorem 6) whose evaluation costs are signiﬁcantly lower than those of the input R to the algorithm. Theorem 7. ( Completeness of pruned-MiCR (in the MiCR sense)) Given a CQAC problem input (Q, V, R), where R is a CQAC MiCR for Q using

Approximate Rewriting of Queries Using Views

177

V, let R be the (CQAC) output of algorithm pruned-MiCR. Then R is a CQAC MiCR for Q and V. While being complete in the sense of Theorem 7, algorithm pruned-MiCR is not complete in the sense that it does not always produce a (globally) minimized MiCR for its problem inputs. The reason is that pruned-MiCR does not consider shared variables across query subgoals (i.e., variables that occur in two or more subgoals of the query) while minimizing the MiCR, see [1] for the details. Finally, the complexity of pruned-MiCR is singly exponential in the size of its problem inputs. See [1] for the details.

References [1] Afrati, F., Chandrachud, M., Chirkova, R., Mitra, P.: Approximate rewriting of queries using views. Technical Report TR-2009-7, NCSU (2009), http://www.csc.ncsu.edu/research/tech/reports.php [2] Bayardo, R., Bohrer, W., Brice, R., Cichocki, A., Fowler, J., Helal, A., Kashyap, V., Ksiezyk, T., Martin, G., Nodine, M., Rashid, M., Rusinkiewicz, M., Shea, R., Unnikrishnan, C., Unruh, A., Woelk, D.: InfoSleuth: Semantic integration of information in open and dynamic environments. In: SIGMOD, pp. 195–206 (1997) [3] Halevy, A.: Data integration: A status report. In: BTW, pp. 24–29 (2003) [4] Ullman, J.: Information integration using logical views. Theoretical Computer Science 239(2), 189–210 (2000) [5] Theodoratos, D., Sellis, T.: Data warehouse conﬁguration. In: VLDB (1997) [6] Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Optimizing queries with materialized views. In: ICDE, pp. 190–200 (1995) [7] Levy, A., Mendelzon, A., Sagiv, Y., Srivastava, D.: Answering queries using views. In: PODS, pp. 95–104 (1995) [8] Abiteboul, S., Duschka, O.: Complexity of answering queries using materialized views. In: PODS, pp. 254–263 (1998) [9] Afrati, F., Chirkova, R., Gergatsoulis, M., Pavlaki, V.: Finding equivalent rewritings in the presence of arithmetic comparisons. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., B¨ ohm, K., Kemper, A., Grust, T., B¨ ohm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 942–960. Springer, Heidelberg (2006) [10] Halevy, A.: Answering queries using views: A survey. VLDB Journal 10(3), 270– 294 (2001) [11] Mitra, P.: An algorithm for answering queries eﬃciently using views. In: Proceedings of the Australasian Database Conference (2001) [12] Pottinger, R., Halevy, A.: MiniCon: A scalable algorithm for answering queries using views. VLDB Journal (2001) [13] Afrati, F., Li, C., Mitra, P.: Answering queries using views with arithmetic comparisons. In: PODS (2002) [14] Tatarinov, I., Halevy, A.: Eﬃcient query reformulation in peer data management systems. In: SIGMOD, pp. 539–550 (2004) [15] Halevy, A., Ives, Z., Madhavan, J., Mork, P., Suciu, D., Tatarinov, I.: The piazza peer data management system. IEEE Transactions on Knowledge and Data Engineering 16(7), 787–798 (2004) [16] Deutsch, A., Lud¨ ascher, B., Nash, A.: Rewriting queries using views with access patterns under integrity constraints. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 352–367. Springer, Heidelberg (2004)

178

F. Afrati et al.

[17] Grahne, G., Mendelzon, A.: Tableau techniques for querying information sources through global schemas. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 332–347. Springer, Heidelberg (1998) [18] Cal`ı, A., Calvanese, D., Martinenghi, D.: Optimization of query plans in the presence of access limitations. In: EROW (2007) [19] Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. ACM STOC, 77–90 (1977) [20] van der Meyden, R.: The complexity of querying indeﬁnite data about linearly ordered domains. In: PODS, pp. 331–345 (1992) [21] Klug, A.: On conjunctive queries containing inequalities. J. ACM 35(1), 146–160 (1988) [22] Afrati, F., Li, C., Mitra, P.: Rewriting queries using views in the presence of arithmetic comparisons. Theoretical Computer Science 368(1-2), 88–123 (2006) [23] Florescu, D., Levy, A., Suciu, D., Yagoub, K.: Optimization of run-time management of data intensive web-sites. In: VLDB, pp. 627–638 (1999) [24] Zaharioudakis, M., Cochrane, R., Lapis, G., Pirahesh, H., Urata, M.: Answering complex SQL queries using automatic summary tables. In: SIGMOD (2000) [25] Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. In: VLDB, pp. 251–262 (1996) [26] Afrati, F., Gergatsoulis, M., Kavalieros, T.: Answering queries using materialized views with disjunctions. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 435–452. Springer, Heidelberg (1999) [27] Duschka, O., Genesereth, M.: Answering recursive queries using views. In: PODS, pp. 109–116 (1997) [28] Qian, X.: Query folding. In: ICDE, pp. 48–55 (1996) [29] Gupta, A., Sagiv, Y., Ullman, J., Widom, J.: Constraint checking with partial information. In: PODS, pp. 45–55 (1994) [30] Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: The Aqua approximate query answering system. In: SIGMOD, pp. 574–576 (1999) [31] Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: SIGMOD, pp. 539–550 (2003) [32] Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: VLDB, pp. 111–122 (2000) [33] Poosala, V., Ganti, V., Ioannidis, Y.: Approximate query answering using histograms. IEEE Data Engineering Bulletin 22(4), 5–14 (1999) [34] Rizvi, S., Mendelzon, A., Sudarshan, S., Roy, P.: Extending query rewriting techniques for ﬁne-grained access control. In: SIGMOD, pp. 551–562 (2004) [35] Miklau, G., Suciu, D.: A formal analysis of information disclosure in data exchange. In: SIGMOD, pp. 575–586 (2004) [36] Miklau, G.: Conﬁdentiality and Integrity in Data Exchange. PhD thesis, University of Washington (2005) [37] Calvanese, D., Giacomo, G., Lenzerini, M., Vardi, M.: View-based query processing: On the relationship between rewriting, answering and losslessness. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 321–336. Springer, Heidelberg (2005) [38] Afrati, F., Li, C., Mitra, P.: On containment of conjunctive queries with arithmetic comparisons. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., B¨ ohm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 459–476. Springer, Heidelberg (2004) [39] Afrati, F., Li, C., Mitra, P.: On containment of conjunctive queries with arithmetic comparisons (extended version). UCI ICS Technical Report (June 2003)

SQL Triggers Reacting on Time Events: An Extension Proposal Andreas Behrend, Christian Dorau, and Rainer Manthey University of Bonn, Institute of Computer Science III, D-53117 Bonn {behrend,dorau,manthey}@cs.uni-bonn.de

Abstract. Being able to activate triggers at timepoints reached or after time intervals elapsed has been acknowledged by many authors as a valuable functionality of a DBMS. Recently, the interest in time-based triggers has been renewed in the context of data stream monitoring. However, up till now SQL triggers react to data changes only, even though research proposals and prototypes have been supporting several other event types, in particular time-based ones, since long. We therefore propose a seamless extension of the SQL trigger concept by time-based triggers, focussing on semantic issues arising from such an extension.

1

Introduction

In their invited talk at VLDB 2000 [4], the recipients of the 10-Year Paper Award, Stefano Ceri, Roberta Cochrane and Jennifer Widom, identiﬁed timebased triggers as a ”lingering issue” and stated that ”although there are numerous interfaces for specifying and scheduling activities based on time, including database activities, to date we have not seen time-based events incorporated directly into a commercial DBMS trigger system”. Speaking about triggers in commercial DBMS means speaking about triggers in SQL. Fortunately, there has been a standardization of various SQL trigger dialects in SQL:1999 [1,17], but time-based triggers are still missing. Ceri/Cochrane/Widom base their observation on a substantial range of applications which would beneﬁt from such triggers, as did Simon/Kotz-Dittrich 5 years earlier [19]. The advent of monitoring applications and data stream management has recently renewed the interest of many researchers in time-dependent triggering of data-intensive tasks [3]. Our own interest in the subject matter has been ”triggered” by concrete experiments in (ﬁnancial) stream monitoring, too [2]. Even though there is a lot of acclaim for the need of time-based triggering, one quite often encounters doubts whether moving time into database triggers is really the way to go. Such reservations are, however, based much more on general ”gut feelings” that active rules are diﬃcult to understand and/or causing ineﬃciency rather than on concrete evidence based on a fair and serious comparison between extended trigger solutions and workarounds combining, e.g., timers in application programs and embedded SQL commands. J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 179–193, 2009. c Springer-Verlag Berlin Heidelberg 2009

180

A. Behrend, C. Dorau, and R. Manthey

The lack of time events for SQL triggers is even more surprising as there has been a wealth of research on the matter since the early 1980s. HIPAC [9], Ode [13], Ariel [14], SAMOS [10], Snoop [6] are well-known examples of academic projects where trigger languages with time events have been proposed and implemented. Even though the range of variations in syntax and semantics of these languages is as broad as one might expect in academia, there is a nearly unanimous agreement on the principle outline of an extension of ”classical” (i.e., modiﬁcation-based) trigger languages by time issues, which can be summarized quite concisely as follows: (1) Rules can be triggered by the occurrence of time events, too. (2) Periodically repeated triggering of the same reaction is possible, where the period is speciﬁed by an expression returning a time duration. (3) Delaying reaction execution to some later point in time relative to the triggering event of a rule is possible. Whereas delaying makes sense relative to some non-time event only, periodic repetition may or may not depend on a previous modiﬁcation event. There is debate about syntax for expressing such features, and about the one or the other subtlety of interaction between modiﬁcation-based and time-based triggering, but the general setting has been consolidated and the principle meaning of each construct is close to obvious. Nevertheless, up till now there seems to be just one proposal to integrating time into SQL triggers around, a rather recent one (published in 2002), and a rather limited one, not even aiming at a proper orthogonal extensions of SQL:1999, but mainly intended to challenge the DB community as far as eﬃcient implementation is concerned [16]. Hanson and Noronha propose in essence to extend conventional modiﬁcation triggers with a CHECK EVERY construct which introduces the possibility to periodically react to modiﬁcations of the database at the end of a speciﬁed interval of time rather than immediately after the update. That is, the modiﬁcation event triggering a rule can be regarded as setting a timer which repeats the reaction each time it expires. Thus, periodic repetition of reactions can be achieved, but no more. In our approach presented here, we extend the SQL:1999 trigger sublanguage by a reasonably small, but powerful set of new features which can be orthogonally combined with each other as well as with those components of SQL triggers introduced so far. The functionality expressible by these extensions covers the three categories mentioned above to more or less the same degree as done by previous none-SQL proposals. Moreover, we introduce functionality not proposed in the literature yet, such as, e.g., being able to choose whether to base delay or periodic repetition on valid time or transaction time events, as proposed in temporal extensions of SQL since long time [20]. We discuss the rather straightforward semantic basis of this extension, and propose solutions to the intricacies and pitfalls arising when looking at semantics issues of such triggers more closely. We refrain, however, from deﬁning syntax and semantics rigorously, as doing so would blow up the paper and prevent us from clearly making the point we want to make, namely that there is a rather easy, but powerful and yet not obvious solution to the challenge of introducing time into SQL triggers. Thus, this is indeed more of an ”ideas and issues” paper rather

SQL Triggers Reacting on Time Events: An Extension Proposal

181

than a ”battleﬁeld report” full of success stories (yet). However, we believe that a ﬁrst step is overdue towards ﬁlling the unfortunate gap between state-of-the-art in academic research on time and triggers and the growing commercial needs for such features. Without a proper transfer to SQL, this need will never be satisﬁed! We have a prototype implementation of an DBMS extension, though, at present supporting an early version of our language only, based on MS Access extended by a Java add-on. This is an academic solution and not really meaningful as far as serious measurements are concerned, but at least demonstrating feasibility of the approach [22].

2

Events in SQL: A Reminder

In active DB research, events have been generally considered occurrences of a happening of interest at a certain point in time, i.e., events are considered atomic. Only few authors (e.g., [12]) have been discussing events with duration. In an active DBMS, an event monitor is responsible for observing events and recording event-related information (event parameters), in particular the time point at which the occurrence took place. Similar events are considered instances of the same event class, call events and time events, resp., being examples of such event classes. When users specify active rules, they usually state a particular event class, possibly further restricting their speciﬁcation by expressing conditions on certain event parameters. Thus, every event which is an instance of the speciﬁed (sub)class triggers the rule. Within this framework, by now widely accepted for active databases, call events (as supported in SQL:1999 in form of data change statements) and time events (which we propose to include into SQL) diﬀer considerably, so that it is worthwhile to clearly identify these diﬀerences in general before turning to a speciﬁc proposal. Call events are characterized by the existence of an event-related operation (DBMS routine or external procedure) which is called. It is expected that the actual execution of this event-related operation will begin shortly after the call, but with some recognizable delay controlled by the active DBMS. It is important to understand that the event triggering the rule (and thus causing the speciﬁed reaction) is the call, not the (beginning or end of the) execution of the event-related operation. Therefore, reactions in SQL resulting from active rule triggering may already take place before execution of the event-related operation has actually started (BEFORE-triggers). In such cases, it is the role of the active DBMS to delay operation execution till the end of the BEFORE-phase. In contrast, reactions of SQL AFTER-triggers take place only after the event-related operation has stopped executing. Event parameters of a call event are the operation called and the parameters of the call itself, e.g., the table modiﬁed, the type of modiﬁcation and the attribute values modiﬁed. Time events, on the other hand, do not have an associated event-related operation. Strictly speaking there is not even a proper ”happening of interest” taking place, apart from the fact that the very point in time actually has been reached. Consequently, a time event does not have any event parameters

182

A. Behrend, C. Dorau, and R. Manthey

other than the time point of its occurrence. Again there will be a certain delay between observation of the time event and the start of the ﬁrst reaction caused by a triggered rule, but there cannot be any BEFORE- or AFTER-phase of reaction, simply because there is no operation execution before or after which the reaction could be scheduled. As SQL supports only call events up till now, there is no special need for an independent event monitor. Call events in SQL are simply data change statements which are under control of the transaction manager and could be directly processed within it. Time-based events, however, are not noticed during normal operation but have to be explicitly generated and signalled by a background process which uses, e.g., timers or countdowns. This motivates an independent component for monitoring and managing both time-based as well as call events in one place, becoming particulary apparent when considering complex event speciﬁcations. Furthermore, time events - like any other type of event signalled from external media like sensors or clocks - cause the problem of monitoring events occurring while the active DBMS is oﬄine. For time events, one could imagine sophisticated routines for ”recovering” time events missed by the event monitor during oﬄine phases. However, we will tacitly assume in this paper that the DBMS is continuously online and able to react in order to limit the scope of discussion.

3

Time-Based Triggers for SQL

We propose to extend SQL triggers along the line outlined in the introduction, i.e., by adding the three following features: Time events for triggering rules at some ﬁxed time, delayed execution of a trigger reaction and repeated executions of the reaction of a trigger. Up till now, rules may be triggered by call events only, and reactions of SQL triggers are always executed just once. We claim that all variants of using time in connection with rule triggering introduced in the literature so far can be expressed by means of these three concepts. We propose a very small set of simple, but powerful additions to SQL’s trigger syntax (which we assume as being known for brevities sake) able to express each of these temporal features in a natural manner, without interfering with any of the nontemporal features of rule triggering, i.e., conservatively extending the standard. Each of these keywords can be considered an operator taking as parameter an expression over one of the temporal domains already supported in SQL today: (1) The keyword AT combined with an expression returning a timestamp will be used to introduce ﬁxed points in time, e.g. AT 2007-09-29 12:00. (2) A system function EXIT returns the timestamp of the end of execution of a triggering data change statement, i.e., the moment in time when a regular AFTER trigger would start to process its reaction. This is needed as a reference point for delay. (3) A keyword FROM followed by a timestamp expression for deﬁning when to start some periodic repetition. (4) The keyword EVERY combined with an expression returning a period for deﬁning repetition intervals. None of the three keywords (AT, EVERY, FROM) is new or its use surprising, but we believe that the way they can be combined with each other and in which

SQL Triggers Reacting on Time Events: An Extension Proposal

183

they can be used to extend ”traditional” SQL triggers without aﬀecting other trigger parts is indeed new and remarkable. EXIT is new and has never been proposed before, as far as we know. 3.1

Time Events

Rules triggered on certain absolute time events are the most common feature of time triggers (TTs). Expressing something like this in SQL is very easy using the keyword AT, e.g. AT 2009-03-18 23:59. For long-term planning of activities, triggers like this may be valuable, particularly if triggering depends on a condition being satisﬁed over the database state reached when triggering time has come (but not predictable before), e.g.: CREATE TRIGGER t1 AT 2009-12-31 23:59 WHEN (SELECT COUNT(*) FROM students)<100 BEGIN ATOMIC ... END Another motivation might be that the reaction is an update depending on data present in the (future) DB state reached at triggering time, e.g.: CREATE TRIGGER t2 AT 2009-12-31 23:59 BEGIN ATOMIC INSERT INTO annual_sales (SELECT 2007, SUM(sold) FROM transactions WHERE date BETWEEN 2007-01-01 AND 2007-12-31); END Note that AT may be combined with any expression returning a timestamp. This in particular allows time event speciﬁcations of the form ’AT 2007-12-31 23:59 + 1 YEAR’ or ’CURRENT TIMESTAMP + 2 DAYS’, e.g. as proposed in Oracle [21]. 3.2

Delayed Reaction Execution

Reaction execution may be delayed by combining a call event with a temporal oﬀset. This oﬀset is either added to the point in time when the triggering command terminates (returned by EXIT) or to a time-valued attribute of (one of) the modiﬁed rows, thus generating ”relative events”. These two variants of delayed reaction can be found in another pair of examples. For instance, if a failed exam has been recorded in the exams table, it might be reasonable to monitor compliance with the exam regulations, e.g. to check three months after the failed exam whether the student has signed in for the second attempt or not. A rule implementing exactly this kind of checking policy has the following general form in our proposed SQL extension:

184

A. Behrend, C. Dorau, and R. Manthey

CREATE TRIGGER t3 AFTER INSERT ON results REFERENCING NEW ROW AS x AT EXIT + 3 MONTHS WHEN x.result = ’failed’ BEGIN ... END The end of the 3-months-period however, might not be 3 months after the result has been registered in the database, but 3 months after the day when the failed exam has actually occurred in reality. Assuming this day to be one of the attribute values of the inserted row, it could be referenced via the NEW transition variable leading to the data-dependent oﬀset speciﬁcation AT x.date + 3 MONTHS instead. Note that these two forms of oﬀsets reﬂect the diﬀerence between so-called transaction and valid time, which is made in temporal database research [20]. It is also possible to specify an oﬀset which is explicit as well as data-dependent, e.g. AT EXIT + x.duration DAYS, thus combining transaction time and valid time aspects. 3.3

Repeated Reaction Execution

Repeating execution of a particular DB-related action regularly after a ﬁxed period of time has elapsed is one of the most frequently proposed temporal features of active rule languages. There have been numerous proposals for more or less sophisticated ways how to express periodic repetition (not only related to active databases). The use of EVERY as the keyword introducing any such periodical repetition clause stems from SAMOS [10] and is both, obvious and convenient. EVERY is to be combined with an expression of type PERIOD, e.g. EVERY MONTH, or EVERY 3 HOURS. Even more sophisticated speciﬁcations introducing a period only implicitly such as EVERY MONDAY, EVERY 1st SUNDAY PER MONTH, or EVERY DAY EXCEPT SUNDAY could be imagined. We do not elaborate on a concrete and complete syntax for period speciﬁcations of this kind here - for making the point, the examples below will suﬃce. A time event trigger weekly transferring credit points of students to their records, e.g., may look as follows: CREATE TRIGGER t4 FROM 2009-05-23 12:15 EVERY 7 DAYS WHEN (SELECT count(*) FROM conf_credits)>0 BEGIN ATOMIC UPDATE studs_rec SET time=systimestamp, credit=credit+(SELECT conf_credits.credit FROM conf_credits WHERE studs_rec.name= conf_credits.name); END Periodical repetition will always have to start from a ﬁxed point in time, to be introduced by means of a timestamp expression preceded by the keyword FROM (replacing the keyword AT used for non-repetitive execution). In case of

SQL Triggers Reacting on Time Events: An Extension Proposal

185

delayed reactions it is self-evident to allow data-dependency in connection with period speciﬁcations as well, e.g. yielding a speciﬁcation of the following form: REFERENCING NEW ROW AS x FROM x.date EVERY x.period DAYS.

4

Semantics

The following semantics discussion uses well-established notions of active rule processing as proposed, e.g., in [18] and aims at full orthogonality between the new features and those already present in SQL. 4.1

Time Triggers

As for top-level SQL-data statements, one would expect the execution of an activated TT to be transaction-initiating. This is because the action part of a TT may include data change statements leading to the activation of further triggers. The resulting set of activated modiﬁcation-based triggers should be executed in one and the same transaction in order to guarantee that either all of the triggered changes are performed, or none of them. In SQL, the eﬀects of a data change statement may be rolled back due to an error caused by the event-related operation itself or by a subsequently triggered action. In this case, all actions executed by activated triggers for this statement are rolled back, too. In contrast, a time event activating a set of triggers possesses no event-related operation which could possibly cause a rollback. Therefore, triggers activated simultaneously, don’t share any common context and can be executed within separate transactions. Indeed, this ought to be the model of execution as integrating the execution of TTs into running transactions or using one joint transaction for processing all triggers activated by the same time event would establish unfounded causal dependencies. If, however, such dependencies are intended, a serial execution should be enforced by combining the corresponding actions into one trigger or by using diﬀerent activation times. In the context of concurrent DB access, the interaction between TTs and other DB-related activities has to be considered. Obviously, already running transactions may prevent TTs from being executed on time because necessary resources are blocked. Since it can be assumed that the activation of a TT has been chosen deliberately, substantial execution delays may often be unacceptable. To avoid this, one could provide TTs with the ability to force other DB-related activity to release blocked resources which may even result in a partial rollback of transactions. Note that this should not mean to interrupt other TTs as it seems desirable to ensure a chronological execution for triggers with diﬀerent activation times. Coupling Modes. Coupling modes determine when a triggered action is to be executed in relation to the occurrence of the triggering event and the currently running transactions [18]. The modiﬁcation-based triggers in SQL use delayed event-condition (EC) and immediate condition-action (CA) coupling, the latter

186

A. Behrend, C. Dorau, and R. Manthey

Fig. 1. Execution phases of call triggers in SQL

specifying that the C and A part of an activated trigger is evaluated/executed over the same DB state. From the discussion above it follows that at least one coupling mode for TTs must be delayed in order to avoid problems for the scheduler. In order to follow design decisions already made for triggers in SQL, we propose to use delayed EC coupling as well as immediate CA coupling for TTs, too. Activation Time vs. Execution Time. The point in time speciﬁed by a TT’s AT (FROM) clause will be referred to as its activation time. However, its actual execution time may be diﬀerent due to reasons discussed above. Because of the proposed immediate CA coupling, condition evaluation of a TT takes place at its execution time. Scheduling of TTs is quite straightforward because their activation times are immediately determined at schema deﬁnition time. In particular, functions within the AT (FROM) clause of a TT are applied as soon as the CREATE trigger command is issued. Periodic Event Specifications. A periodic TT is associated with a set of activation times determined at schema deﬁnition time. This leads to a set of execution times at which the condition and action of the trigger is evaluated/executed, respectively. Obviously, placing these executions into a common transaction is not practicable as this would lead to a potentially inﬁnite transaction. Therefore, the execution of each condition/action part is to be placed into a separate transaction. If there are logical dependencies between these transactions, however, possible dependencies between them could be reﬂected in the trigger’s condition clause. Thus, we propose the more general approach where each transaction is considered independent. 4.2

Delayed and Repeated Reaction

Activation vs. Reaction Time. Due to the potential data dependency of time event speciﬁcations, trigger activation cannot be determined at schema deﬁnition time but only once the triggering modiﬁcation event has occurred. Allowing the application of time-related functions within AT/FROM clauses (e.g., AT CURRENT TIMESTAMP + nr.deadline) raises the question when to evaluate such data-dependent time event speciﬁcations. The possible choices are illustrated in Fig. 1 where the execution phases of ordinary SQL triggers carried out after observation of a call event E at time T1 are recalled. A natural choice would be T3 , i.e., immediately after the trigger execution contexts (TECs) have been created [17]. This coincides with the activation time of ordinary SQL triggers. As the

SQL Triggers Reacting on Time Events: An Extension Proposal

187

determined transition tables are potentially re-modiﬁed by BEFORE-triggers, however, their content should not be used for instantiating parameterized time event speciﬁcations. Therefore, we propose T12 as a reliable reference time at which a given TEC is ﬁxed and integrity maintenance by referential actions as well as immediate integrity checking has been successfully completed. This point coincides with the actual execution time of ordinary AFTER-level triggers and is the point in time returned by EXIT. Delayed triggers can be seen as a generalization of ordinary AFTER-level triggers, but because of the chosen semantics there remains a subtle diﬀerence. As an example, consider the following triggers t5 and t6 CREATE TRIGGER t5 AFTER INSERT ON studs AT EXIT

CREATE TRIGGER t6 AFTER INSERT ON studs ...

which behave diﬀerently despite of the chosen non-positive delay within the event speciﬁcation of t5. The reason is that t5 initiates its own transaction whose potential failure has no consequence for the successful execution of the triggering call event anymore. Thus, our proposed concept of delayed triggers is an extension rather than a generalization. Coupling Modes. For orthogonality reasons, delayed EC coupling and immediate CA coupling ought to be chosen for triggers with delay, too. With respect to EC coupling, however, the question arises whether the condition is evaluated within the same transaction (delayed mode) or within a diﬀerent one (detached mode). Delayed coupling mode as for modiﬁcation-based triggers introduces two forms of dependencies. On the one hand, all triggers are executed only if the corresponding modiﬁcation has been successfully applied to the database. On the other hand, if trigger execution fails the corresponding modiﬁcation is rolled back, too. The ﬁrst condition ought to be satisﬁed for the execution of delayed triggers, too. That is, they should solely react to modiﬁcation events whose related operations have not been rolled back. To require satisfaction of the second condition, however, seems to be inappropriate for delayed triggers. SQL considers a modiﬁcation and its associated triggers to be an atomic unit due to integrity reasons. Having a substantial time delay, however, considerably weakens the connection between this kind of triggers and their activating modiﬁcation. In fact, it is impracticable to employ them for integrity related tasks with respect to its activating modiﬁcation which has occurred a long time ago. Thus, we suggest a detached coupling mode while the execution of transaction related to the trigger remains causally dependent [9] on commit of the transaction in which the activating modiﬁcation took place. Since the eﬀects of a committed modiﬁcation can be erased by a subsequent transaction, the condition part of a delayed trigger has to be used for checking whether the eﬀects of the activating modiﬁcation are still present in the database by the time of its execution. Periodic Triggering. The condition/action part of a trigger with periodic event speciﬁcation is iteratively evaluated/executed as long as the trigger remains an active schema object. As mentioned above, the execution of each CA

188

A. Behrend, C. Dorau, and R. Manthey

block ought to be placed into a separate transaction leading to a set of transactions where each considered independent. Having chosen a causally dependent coupling mode, however, introduces a dependency which ought to be considered by the newly generated transactions, too. This is achieved by making these transactions causally dependent on the activating modiﬁcation event but leaving them mutually independent. Thus, each transaction is solely initiated if the activating modiﬁcation is successfully processed, but remains independent of the failure of others induced by the same periodic event speciﬁcation. Execution Granularity. As delayed triggers can be viewed as a special form of modiﬁcation-based AFTER-triggers, they have an execution granularity speciﬁcation, too. In SQL, the trigger execution granularity speciﬁes the execution mode of a trigger as either FOR EACH ROW or FOR EACH STATEMENT and determines how many times an activated trigger is executed within its own transaction. While a ROW LEVEL-trigger is executed for each row aﬀected by its activating modiﬁcation, a STATEMENT LEVEL-trigger is executed exactly once. ”Normal” AFTER-triggers in SQL have their own TEC which does not change anymore and is pushed onto a stack when a new execution context is activated due to the invocation of an update statement from the trigger’s action part. After processing all (re)actions resulting from the statement, the suspended TEC is popped oﬀ the stack and reactivated. This so-called recursive cycle policy [18] is usually considered in systems supporting immediate rule processing. However, since we have proposed detached EC coupling for delayed/repeated triggers, the cycle policy has to be changed appropriately. Such triggers behave like AFTER-triggers until they are selected for execution somewhere between T12 and T13 . Because of a possible time oﬀset given either implicitly or explicitly by the AT/FROM speciﬁcation, their execution in a separate transaction may be postponed until this oﬀset expires. The earlier determined TEC is materialized as it provides the transition data needed during actual trigger execution in a separate transaction. Thus, within the activating transaction such triggers behave like AFTER-triggers with an empty action part such that their recursive evaluation is not interrupted. Within the executing transaction, however, the action part of the trigger is recursively evaluated while the saved and re-activated TEC would always be the ﬁrst element on the stack attached to this transaction. The materialization of transition tables over a potentially long period of time is a performance-related issue which a user must take into consideration when using delay or repetition. 4.3

Trigger Interactions

In Section 2 we already discussed the diﬀerences between time and modiﬁcation events which leads to a diﬀerent behavior of time-based and modiﬁcation-based triggers. Time events are external events and may trigger several rules which share no common context. Therefore, these rules ought to be executed within separate transactions. A modiﬁcation event, however, may also be generated by an external process but this would lead to the initiation of one transaction only

SQL Triggers Reacting on Time Events: An Extension Proposal

189

in which all triggered rules are processed. This is due to the fact that all triggered reactions share the same context given by the set of modiﬁed tuples which results from the called operation. Having an event speciﬁcation that refers to both event types - as it is the case for delayed or repeated triggers -, the context reference is retained by making the corresponding transactions causally dependent.

5

Further Extensions

In this section, trigger extensions are proposed which help to further control the execution of both trigger types. To this end, the CREATE TRIGGER statement is extended by corresponding optional components and a new SET TRIGGER statement is introduced for modifying certain characteristics of trigger execution. Isolation Levels. As the execution of a time-based trigger leads to the initiation of a respective transaction, one obvious extension would be to provide optional control parameters within the trigger for specifying the isolation level of this transaction. Isolation levels allow for trading-oﬀ performance against correctness as they determine what data the transaction can see when other transactions are running concurrently. The following trigger deﬁnition illustrates how to set the isolation level to REPEATABLE READ in order to avoid dirty and fuzzy reads [1]: CREATE TRIGGER t7 FROM 2009-01-31 12:15 EVERY 2 MONTHS WITH ISOLATION LEVEL REPEATABLE READ BEGIN Call send_email (studs,studs_rec); END Using the isolation level SERIALIZABLE may postpone the execution of an activated time-based trigger until all running transactions have been ﬁnished. This is also the case for systems which do not allow for concurrent execution of transactions such that all transactions have to be sequentially executed. Deadlines. When delaying the execution of an activated trigger it is useful to specify a time period during which it is still meaningful to process it. After the expiration of this deadline, however, the execution of the activated timebased trigger ought to be cancelled. The concept of trigger execution deadlines is already known from so-called real-time databases [11]. It allows for specifying to which extent the time-point of the real execution may diﬀer from the intended one and could be used for a postponed execution of triggers activated within an oﬄine phase of the system. Another problem addressed by deadlines is the avoidance of the excessive growth of the set of activated triggers which are still to be executed. This problem can occur when the time oﬀset between activation and execution time of trigger with a periodic event speciﬁcation is frequently greater than its ﬁxed

190

A. Behrend, C. Dorau, and R. Manthey

period. But also the interaction of several time-based triggers with quite similar intended activation times and long execution times may cause the system to become unstable unless reasonable deadlines are applied. As an example of the application of a deadline consider the following periodic event speciﬁcation ... FROM 2009-01-31 AT 12:15 EVERY 2 MONTHS DEADLINE 2 days ... with a deadline of 2 days in case one of the intended execution time-points coincides with a week-end where the system might be switched oﬀ. In general, for triggers with a periodic event speciﬁcation a deadline value should be chosen which forces each trigger execution to take place before the occurrence of the next intended activation time-point. Note that in contrast to transaction control parameters, the concept of deadlines is also useful for modiﬁcation-based triggers in order to limit the number of recursively activated triggers. Activation States and Validity Periods. SQL triggers are static schema objects which were not supposed to be modiﬁed after their creation. Dynamic aspects with respect to triggers by means of priorities and activation states, however, suggest further control parameters which ought to be modiﬁable during the lifetime of a trigger as well. The activation state of a trigger is either active or inactive and allows for temporary deactivation. Inactive triggers remain persistent schema objects but are not selected during the trigger activation phase. Activation states can be chosen to be INITIALLY ACTIVE (default) or INITIALLY INACTIVE within the CREATE TRIGGER statement and ought to be alterable by a new SET TRIGGERS statement. It is sometimes useful to change the activation state of a trigger automatically by specifying a respective validity period. For instance, it is advantageous to restrict the application of the trigger t8 to the actual semester period which could be achieved by a corresponding validity period ... FROM 2009-01-23 12:15 EVERY 7 DAYS UNTIL 2009-05-23 12:00 ... The trigger t7 is activated on January 23th 2009 and automatically deactivated on May 23th 2009 such that redundant applications are avoided. Validity periods allow for an automatic control of trigger activation states and are useful for all trigger types except those with an absolute time event speciﬁcation. Trigger Priorities. Another dynamic aspect of triggers are priorities which ought to be changeable in order to correct an erroneous schema design or to avoid the redeﬁnition of existing triggers if new triggers with a lower priority are introduced during schema evolution. There exist various possibilities for deﬁning

SQL Triggers Reacting on Time Events: An Extension Proposal

191

a dynamic priority concept based on numerical, relative or timestamp priorities. We refrain from discussing dynamic priorities in detail but solely stress two important points with respect to the speciﬁc SQL context. First, trigger priorities ought to be chosen independent of the event type. Second, the resulting priorities must represent a well-ordered set such that all triggers have got pairwise diﬀerent priorities. For changing priorities which have been assigned at trigger creation time, we propose again to employ a SET TRIGGERS statement.

6

Related Work

As mentioned in the introduction, time events have been considered in various active database systems before (e.g. [6,9,10,13]). All these prototypical implementations provide absolute, periodic and relative time event speciﬁcations. As in our approach, the considered temporal domain is discrete with a granularity that usually ranges from year to second. In all systems, time events are considered external events and, therefore, transaction-initiating. Additionally, the occurrence and observation time of clock ticks is not distinguished assuming no positive recognition delay by the system. Event parameters are usually not provided by the event speciﬁcation but can be accessed in form of timestamps in the condition or action part of an active rule in those systems. In contrast to our approach, parameterized event speciﬁcations such as occ point(E1) + E1.attr2, meaning the absolute time event which can be computed from the occurrence time of the event E1 and the time oﬀset given in the attribute E1.attr2, are generally not supported. An interesting feature especially for periodic event speciﬁcations is the validity interval concept for events in SAMOS which allows for specifying a time interval during which occurrences of events with a speciﬁc type may be monitored. This concept is closely related to an activation state speciﬁcation of SQL triggers. One main diﬀerence between our proposal and the above languages is that clock events can be freely combined with other events using so-called event constructors. To this end, a rich event algebra is provided which allows to deﬁne complex triggering conditions and a mechanism for detecting the occurrence of events with a complex speciﬁcation, e.g. colored petri nets in SAMOS [10]. Consequently, further time-related events can be deﬁned using such complex event speciﬁcations, e.g., aperiodic events in HIPAC or SNOOP or special periodic events in SNOOP. We refrain from discussing such complex event speciﬁcations as they are beyond the scope of this paper. A last signiﬁcant diﬀerence can be found with respect to the semantics of relative event speciﬁcations. Usually, the time oﬀset of a relative event speciﬁcation in those systems is added to the occurrence time of the referenced modiﬁcation event providing a transparent view on the semantics of such speciﬁcations for the user. However, we propose a different (implicit) reference time point (start of the AFTER-phase) for the time oﬀset for orthogonality reasons (cf. Section 4.2). All systems discussed above are based on highly expressive object-oriented data model. In [16], however, even an alternative extension proposal for SQL triggers in a relational context based on time events has been discussed which is

192

A. Behrend, C. Dorau, and R. Manthey

more suited for a direct comparison. The authors propose a new trigger language for SQL that allows for deﬁning so-called timer-driven triggers. They are always associated with a data source which can be either a table, a view or a select statement in SQL and have a timer declaration which controls their execution time. When the timer of a trigger expires, the trigger is executed and can access the data changes of its data source which occurred since its last execution. The timer declaration directly corresponds to an absolute or periodic time event speciﬁcation whereas relative time event speciﬁcations are not supported. The main diﬀerence to our approach can be seen in the philosophy of how time-related triggers are to be interpreted. The timer-driven triggers of [16] are immediately activated after their creation and postpone their executions until their timers expire. Thus, each trigger could be interpreted as a kind of active background process which waits for the right time to execute its commands. In our approach, however, time-based triggers are passive schema objects which are activated by an event monitor after a matching external clock event has been observed. Another approach for realizing temporal functionality in SQL-based systems is given by timer controlled programs or through PL/SQL routines executed at predeﬁned times as provided in Oracle. While the former approach bears the usual disadvantages that arise when realizing database functionality externally, preferring time-based triggers over so-called DB jobs is less obvious. In fact, simple forms of time-controlled execution using absolutely or periodically deﬁned points in time resemble cron jobs in operating systems and found their way into commercial systems in form of DB jobs, e.g. formulated as PL/SQL routines. However, for more complex forms of reactive behavior using both time and modiﬁcation events, the application of DB jobs no longer seems appropriate. As DB jobs are not capable of directly reacting to modiﬁcation events, this would ultimately require the duplication of functionality already provided by the trigger manager.

7

Conclusion

In this paper, we present a particular proposal for the syntax and semantics of time-based triggers in SQL. To this end, we propose a small set of new features which can be orthogonally combined with each as other as well as with those components employed for modiﬁcation-based triggers in SQL. The functionality expressible by these extensions covers the three standard types of temporal categories by means of absolute, periodic and relative event speciﬁcations and allows to base delay or periodic repetition on valid time or transaction time events, respectively. Prototypical implementations of time-related triggers have been completed in our research group as results of master thesises [22].

References 1. American National Standard ANSI/ISO/ IEC 9075-1/2:1999. Information Systems–Database Languages–SQL–Part 1/2: Framework/Foundation. American National Standards Institute, Inc. (1999) 2. Behrend, A., Dorau, C., Manthey, R., Sch¨ uller, G.: Incremental View-Based Analysis of Stock Market Data Streams. In: IDEAS 2008, pp. 269–275 (2008)

SQL Triggers Reacting on Time Events: An Extension Proposal

193

3. Carney, D., et al.: Monitoring Streams - A New Class of Data Management Applications. In: VLDB 2002, pp. 215–226 (2002) 4. Ceri, S., Cochrane, R., Widom, J.: Practical Applications of Triggers and Constraints: Success and Lingering Issues. In: VLDB 2000, pp. 254–262 (2000) 5. Chakravarthy, S., et al.: Composite Events for Active Databases: Semantics, Contexts and Detection. In: VLDB 1994, pp. 606–617 (1994) 6. Chakravarthy, S., Mishra, D.: Snoop: An Expressive Event Specification Language for Active Databases. DKE 14(1), 1–26 (1994) 7. Cochrane, R., Pirahesh, H., Mattos, N.M.: Integrating Triggers and Declarative Constraints in SQL Database Sytems. In: VLDB 1996, pp. 567–578 (1996) 8. Ceri, S., Widom, J.: Active Database Systems: Triggers and Rules for Advanced Data Processing. Morgan Kaufmann, San Mateo (1996) 9. Dayal, U., et al.: The HiPAC Project: Combining Active Databases and Timing Constraints. SIGMOD Record 17(1), 51–70 (1988) 10. Dittrich, K.R., et al.: SAMOS in hindsight: experiences in building an active objectoriented DBMS. Information Systems 28(5), 369–392 (2003) 11. Eriksson, J.: Real-Time and Active Databases: A Survey. In: Andler, S.F., Hansson, J. (eds.) ARTDB 1997. LNCS, vol. 1553, pp. 1–23. Springer, Heidelberg (1998) 12. Galton, A., Augusto, J.C.: Two Approaches to Event Definition. In: Hameurlain, A., Cicchetti, R., Traunm¨ uller, R. (eds.) DEXA 2002. LNCS, vol. 2453, pp. 547– 556. Springer, Heidelberg (2002) 13. Gehani, N.H., Jagadish, H.V.: Ode as an Active Database: Constraints and Triggers. In: VLDB 1991, pp. 327–336 (1991) 14. Hanson, E.N.: The Design and Implementation of the Ariel Active Database Rule System. DKE 8(1), 157–172 (1996) 15. Hanson, E.N., et al.: Scalable Trigger Processing. In: ICDE 1999 (1999) 16. Hanson, E.N., Noronha, L.: Timer-Driven Database Triggers and Alerters: Semantics and a Challenge. SIGMOD Record 28(4), 11–16 (1999) 17. Melton, J., Simon, A.R.: SQL:1999 - Understanding Relational Language Components, 2nd edn. Morgan Kaufmann, USA (2001) 18. Paton, N.W., D´ıaz, O.: Active Database Systems. ACM Computing Surveys 31(1), 63–103 (1999) 19. Simon, E., Kotz Dittrich, A.: Promises and Realities of Active Database Systems. In: VLDB 1995, pp. 642–653 (1995) 20. Snodgrass, R.T.: The TSQL2 Temporal Query Language, 2nd edn. Kluwer Academic Publishers, Dordrecht (2007) 21. Urman, S., Hardman, R., McLaughlin, M.: Oracle Database 10g PL/SQL Programming. McGraw-Hill, Emeryville/CA (2004) 22. Wernecke, J.: Eine Erweiterung des aktiven Daten- banksystems ARTA um relative Zeitereignisspezifika- tionen. Master Thesis, University of Bonn (2005)

Pushing Predicates into Recursive SQL Common Table Expressions Marta Burza´ nska, Krzysztof Stencel, and Piotr Wi´sniewski Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toru´ n Poland {quintria,stencel,pikonrad}@mat.umk.pl

Abstract. A recursive SQL-1999 query consists of a recursive CTE (Common Table Expression) and a query which uses it. If such a recursive query is used in a context of a selection predicate, this predicate can possibly be pushed into the CTE thus limiting the breadth and/or depth of the recursive search. This can happen e.g. after the definition of a view containing recursive query has been expanded in place. In this paper we propose a method of pushing predicates and other query operators into a CTE. This allows executing the query with smaller temporary data structures, since query operators external w.r.t. the CTE can be computed on the fly together with the CTE. Our method is inspired on the deforestation (a.k.a. program fusion) successfully applied in functional programming languages.

1

Introduction

Query execution and optimisation is a well-elaborated topic. However, the optimisation of recursive queries introduced by SQL-1999 is not advanced yet. A number of techniques is known in the general setting (e.g. the magic sets [1]), but they are not applied to SQL-1999. Since, the recursive query processing is very time-consuming, new execution and optimisation methods for such queries are needed. It seems promising to push selection predicates from the context of a usage of a recursive query, into the sole query (in fact into its CTE). The method of predicate-move-around [2] is very interesting. It allows pushing and pulling predicates to places where their execution promises biggest gain in terms of the query running time. However, this method applies to non-recursive queries only. Recursive queries are much more complex, since predicates external to them must be applied to all nodes reached during the execution, but not necessarily to all visited nodes. It could be useful to push such predicates into the initial step or into the recursive step. However, we cannot do it straightforwardly, since the predicate holding for the result does not need to hold for all nodes visited on the path to the result In this paper we propose a method of pushing predicates into CTEs subtle enough not to change the semantics of the query. Together with pushing predicates our method also tries to push other operators into the recursive CTE so that as much as possible part of computing is performed J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 194–205, 2009. c Springer-Verlag Berlin Heidelberg 2009

Pushing Predicates into Recursive SQL Common Table Expressions

195

on the ﬂy together with the recursive processing. This reduces the space needed for temporary data structures and the time needed to store and retrieve data from them. This part of our optimisation method is inspired by the deforestation developed for functional languages [3]. This method is also known as program fusion, because the basic idea behind it is to fuse together two functions of which one consumes an intermediate structure generated by the other. This algorithm has been successfully implemented in Glasgow Haskell Compiler (GHC [4]) and proved to be very eﬀective. But it has to be mentioned, that GHC is not equipped with the original deforestation technique. The algorithm of [3], although showing a great potential, was still too complicated and did not cover all of the possible intermediate structures. This is why many papers on deforestation’s enhancements have been prepared. The most universal, and the simplest at the same time is known as the short-cut fusion, cheap deforestation or foldr-build rule [5,6]. Unfortunately it is not suitable for dealing with recursive functions. The problem of deforesting recursive function has been addressed in [7]. There has been work done on how to translate operators of an object query language into its foldr equivalent. Although most of them have dealt only with OQL operators, they are successful in showing that OQL can be eﬃciently optimised with short-cut deforestation ([8]). But still the issue of optimising recursive queries is open. One of the works in this ﬁeld is [9].It presents three optimization techniques, i.e. deleting duplicates, early evaluation of row selection condition and deﬁning an enhanced index. This paper is organized as follows. In Section 2 we show an example which pictures the possible gains of our method. In Section 3 we explain some small utility optimisation steps used by our method. Section 4 explains and justiﬁes the main optimisation step of pushing selection predicates into CTE. Section 5 shows the measured gain of our optimisation method together with the original query execution plan and the plan after optimisation. We show plans and measures for IBM DB2. Section 6 concludes.

2

Motivating Example

Let us consider a database table Emp that consists of the attributes: (EID ⊂ Z, ENAME ⊂ String, MGR ⊂ Z, SALARY ⊂ R). The column eid is the primary key, while mgr is a foreign key which references eid. The column mgr stores data on managers of individual employees. Top managers have NULL in this column. We deﬁne also a recursive view which shows the subordinate-manager transitive relationship, i.e. it prints pairs of eids, such that the ﬁrst component of the pair is a subordinate while, the second is his/her manager. From 1999 one can formulate this query in standard SQL: CREATE VIEW subordinates (seid, meid) AS WITH subs(seid, meid) AS ( SELECT e.eid AS seid, e.eid AS meid FROM Emp e UNION ALL

196

M. Burza´ nska, K. Stencel, and P. Wi´sniewski

SELECT e3.eid AS seid, s.meid AS meid FROM Emp e3, subs s WHERE e3.mgr = s.seid ) SELECT * FROM subs; This view can then be used to ﬁnd the total salary of all subordinate employees of, say, Smith: SELECT SUM(e2.salary) FROM subordinates s2 JOIN Emp e2 ON (e2.eid = s2.seid) JOIN Emp e1 ON (e1.eid = s2.meid) WHERE e1.ename = ’Smith’; A na¨ıve execution of such a query consists in materializing the whole transitive subordinate relationship. However, we need only a small fraction of this relationship which concerns Smith and her subordinates. In order to avoid materializing the whole view, we start from a standard technique of query modiﬁcation. We expand the view deﬁnition in line: WITH subs(seid, meid) AS ( SELECT e.eid AS seid, e.eid AS meid FROM Emp e UNION ALL SELECT e3.eid AS seid, s.meid AS meid FROM Emp e3,subs s WHERE e3.mgr = s.seid ) SELECT SUM(e2.salary) FROM subs s2 JOIN Emp e2 ON (e2.eid = s2.seid) JOIN Emp e1 ON (e1.eid = s2.meid) WHERE e1.ename = ’Smith’; The execution of this query can be signiﬁcantly improved, if we manage to push the predicate e1.ename = ’Smith’ to the ﬁrst part of the CTE. In this paper we show a general method of identifying and optimising queries which allow such a push. After this ﬁrst improvement it is possible to get rid of the join with e1 and push the join with e2 as well as the retrieval of the salary into the CTE. After all this changes we get the following form of our query: WITH subs(seid, meid, salary) AS ( SELECT e.eid AS seid, e.eid AS meid, e.salary FROM Emp e WHERE e.ename = ’Smith’ UNION ALL SELECT e3.eid AS seid, s.meid AS meid, e3.salary FROM Emp e3, subs s WHERE e3.mgr = s.seid ) SELECT SUM(s2.salary) FROM subs s2;

Pushing Predicates into Recursive SQL Common Table Expressions

197

The result of the predicate push and the query fusion is satisfactory. Now we traverse only the Smith’s hierarchy. Further optimisation is not possible, by rewriting SQL query to another SQL query (SQL-1999 severely limits the form of recursive CTEs). However, we do need to accumulate neither eids nor salaries. We just need to have one temporary structure, i.e. a number register to sum the salaries on the ﬂy as we traverse the hierarchy. This is the most robust plan (traverse the hierarchy and accumulate salaries). This is a simple application of deforestation and can be done by a DBMS on the level of query execution plans even if its is not expressible in SQL-1999.

3

Utility Optimisations

The ﬁrst step that should be done after expanding the view deﬁnition is purely syntactic. We add alias names for tables lacking them, and we change aliases that are assigned more than once, so that all tables have diﬀerent aliases. This is done by a simple replacement of alias names (α-conversion). The second technique is the elimination of vain joins. This technique is usually applied after some other query transformation. When in one of the parts of the CTE, or in the main part of the query a table is joined by its primary key to the a foreign key of another table, but besides the joining condition it is not used it may be deleted. This is done by removing this table from the FROM clause at the same time removing the join condition in which it is used. There is one subtle issue. The foreign key used to join with the removed table cannot have the value of NULL. Such rows cannot be matched. The join with the removed table plays the role of the selection predicate IS NOT NULL. Thus, if the foreign key is not constrained to be NOT NULL, the selection predicate that foreign key IS NOT NULL must be added. If the schema determines the foreign key to be NOT NULL, this condition is useless and is not added. Another simple conversion is a self-join elimination when the join is one-toone (primary key to primary key). When encountering such a self-join we choose one of the two aliases used in this join, and then substitute every occurrence of one of them (besides its deﬁnition and joining condition) by another. When this is done we can delete the self-joining condition and the redundant occurrence of the doubled table from the FROM clause. This technique is illustrated by the following example. Starting from a query: WITH subs(seid, meid, salary) AS ( SELECT e.eid AS seid, e.eid AS meid, e2.salary as salary FROM Emp e, Emp e2 WHERE e.eid = e2.eid UNION ALL SELECT e3.eid AS seid, s.meid AS meid,e2.salary as salary FROM Emp e3,subs s, Emp e2 WHERE (e3.mgr = s.seid) AND e.eid = e2.eid ) SELECT SUM(e2.salary)

198

M. Burza´ nska, K. Stencel, and P. Wi´sniewski

FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid) WHERE e1.ename = ’Smith’; Using self-join elimination we obtain the query: WITH subs(seid, meid, salary) AS ( SELECT e.eid AS seid, e.eid AS meid, e.salary as salary FROM Emp e UNION ALL SELECT e3.eid AS seid, s.meid AS meid,e2.salary as salary FROM Emp e3,subs s, Emp e2 WHERE (e3.mgr = s.seid) AND e.eid = e2.eid ) SELECT SUM(e2.salary) FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid) WHERE e1.ename = ’Smith’; Self-join elimination can be applied to both parts of CTE deﬁnition and to the main part of the query. In the mentioned example it was applied to the ﬁrst part of the CTE.

4

Predicate Push into CTE

In this section we describe the main part of our technique, i.e. how to ﬁnd predicates which can be pushed into a CTE and how to rewrite the query to push selected predicates into CTE. In subsequent steps we analyse each table used joined to the result of a CTE. Such a table may be simply used in the query surrounding the CTE or may appear to be joined with CTE after e.g. expansion of the deﬁnition of a view (as in the example from Section 2). In the following paragraphs we will call such a table “marked for analysis”. Let us assume that we have marked for analysis a table that does not appear in any predicate besides the join conditions. If this table is joined to the CTE using its primary key, we can mark it for pushing into CTE. This table’s alias may appear in three parts of the query surrounding the CTE: in the SELECT clause pointing to speciﬁc columns, in the condition joining it with CTE or in the condition joining it with some other table. Let us analyse those cases. The ﬁrst case is the simplest — we just need to push the columns into both SELECT statements inside CTE. To do it, we need to follow a short procedure: after copying the table declaration into both inner FROM clauses, we copy the columns’ calls into both inner SELECT clauses, assigning those columns new alias names. We now have to expand CTE’s header using new columns’ aliases. Finally in the outer SELECT clause we replace marked table alias with the outer alias of the CTE. Second case is when the marked table alias is in the condition joining the marked table with CTE. The ﬁrst step is to copy the joining condition into the

Pushing Predicates into Recursive SQL Common Table Expressions

199

ﬁrst part of the CTE. While doing this we need to translate the CTE’s column used for joining into its equivalent within the ﬁrst part. Let us assume that the joining column from the CTE was named cte alias.Col1. In the ﬁrst SELECT clause of the CTE we have: alias1.some column AS Col1. Having this information we substitute the column name cte alias.Col1 with alias1.some column. We proceed analogically when copying the join condition into the recursive part of the CTE. The third case, when marked alias occurs within a join clause that does not involve CTE’s alias, is very similar to the case of copying column names from the SELECT clause. Firstly we need to push columns connected with marked table into CTE (according to the procedure described above). Secondly we replace those columns’ names by corresponding CTE’s columns. All those three cases are illustrated by the following example: Having the query: WITH subs(seid, meid) AS ( SELECT e.eid AS seid, e.eid AS meid FROM Emp e UNION ALL SELECT e3.eid AS seid, s.meid AS meid FROM Emp e3, subs s WHERE e3.mgr = s.seid ) SELECT e2.salary, d1.name FROM subs s2 JOIN Emp e2 ON (e2.eid = s2.seid) JOIN Emp e1 ON (e1.eid = s2.meid) JOIN Dept d1 ON (e1.dept = d1.did) WHERE e1.ename = ’Smith’; The table to be analysed is Emp e2. This table is used in two join conditions (with the CTE, and with the Dept table) and once in the SELECT clause. Therefore we copy the table name into both FROM clauses existing in the CTE deﬁnition, also we copy twice the join with the CTE condition and the column call. Then we replace the aliases as described above. Finally we remove the marked table with its references from the outer selection query. The resulting query is of the form: WITH subs(seid, meid, dept, salary) AS ( SELECT e.eid AS seid, e.eid AS meid, e2.dept AS dept, e2.salary AS salary FROM Emp e, Emp e2 WHERE e2.eid = e.eid UNION ALL SELECT e3.eid AS seid, s.meid AS meid, e2.dept AS dept, e2.salary AS salary FROM Emp e3, subs s, Emp e2 WHERE e3.mgr = s.seid AND e2.eid = e3.eid ) SELECT s2.salary, d1.name FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid) JOIN Dept d1 ON (s2.dept = d1.did) WHERE e1.ename = ’Smith’;

200

M. Burza´ nska, K. Stencel, and P. Wi´sniewski

This form may undergo further optimisations like elimination of self-join. One thing has to be mentioned: if the marked table is not joined with CTE, is should be skipped and returned to later, after other modiﬁcations to CTE. Now let us analyse the situation when a table from the outer query is referenced within a predicate. It should be marked for pushing into CTE, undergo moving into CTE like described above, but without deletion from its original place. We have to check if moving the predicate into CTE is possible. There are many predicates, for which pushing them into CTE would put too big restrictions on the CTE resulting in loss of data. During the research on recursive queries we found that the predicate can be pushed into the CTE only if we can isolate a subtree of the result tree that contains only the elements fulﬁlling the predicate and no other node outside this subtree fulﬁls this predicate. This may be only veriﬁed by checking for the existence of the tree invariant. So a general method for pushing a predicate into CTE is based on checking CTE for the existence of tree invariant and if found, checking if the predicate can be attached to CTE through this invariant. To perform this check we use induction rules. We start by analysing tuple schema generated in the initial step of CTE materialisation. We need to fetch the metadata information on the tables used in FROM clauses. First we create the schema of the initial tuples, so we simply use the SELECT clause and ﬁll the columns with the values found in this clause. Next we analyse the FROM clause and join predicates in the recursion step and from the metadata information we create a general tuple schema that would be created out of a standard tuple. Analysing SELECT clause we perform proper projection onto the newly generated tuple schema thus creating a new schema of a tuple that would be a result of the recursive step. By comparing input and output tuples we may pinpoint the tuple’s element which is the loop invariant. If there is no loop invariant we cannot push the predicates. If there is an invariant, then in order to push the predicate we have to check if it is a restriction on a table joined to the invariant (one of the invariants when many). An easy observation shows that it is suﬃcient to push the predicate only to the initial step, because, based on the induction, it will be recursively satisﬁed in all of the following steps. Let us now observe how this method is performed on an example. Let us analyse a following query (with already pushed the join condition): WITH subs(seid, meid, salary) AS ( SELECT e.eid AS seid, e.eid AS meid, e.salary as salary FROM Emp e, Emp e1 WHERE e1.eid = e.eid UNION ALL SELECT e3.eid AS seid, s.meid AS meid,e3.salary as salary FROM Emp e3, subs s, Emp e1 WHERE e3.mgr = s.seid AND e1.eid = s.meid ) SELECT SUM(s2.salary) FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid) WHERE e1.ename = ’Smith’;

Pushing Predicates into Recursive SQL Common Table Expressions

201

The table Emp e1 occurs in the predicate e1.ename = ’Smith’. In the CTE deﬁnition we reference the table Emp four times and once the CTE itself. From the metadata we know that the Emp table consists of the attributes: (EID, ENAME, MGR, SALARY) and that the EID attribute is a primary key. This means that every tuple belonging to the relation Emp has the form: (e, ne , me , se ). All of the tuple’s elements are functionally dependent on the ﬁrst element. By analysing SELECT clauses of the CTE we know that its attributes are: (SEID ⊂ Z, MEID ⊂ Z, SALARY ⊂ R). The initial step generates tuples of the form: (e, e, se ) Let us assume that tuple (a, b, c) ∈ CTE. During the recursion step from this tuple the following tuples are generated: ((a, b, c), (e1, fe1 , le1 , a, se1 ), (b, fb , lb , mb , sb )) Next by projection on the elements 4-th,2-nd,8-th we get a tuple: (e1, b, se1 ) Comparing this tuple with the initial tuple template we see, that the second parameter is a tree invariant, so we may attach to this parameter table with predicate limiting the size of the result collection. Because the predicate e1.ename = ’Smith’ references a table that is joined to the element b, so it can be pushed into the initial step of CTE. Because all of the information from the outer selection query connected with Emp e1 have been included in the CTE deﬁnition, they may be removed from the outer query. Using the transformations described in 3 to simplify the recursive step, we get as a result: WITH subs(seid, meid, salary) AS ( SELECT e.eid AS seid, e.eid as meid, e.salary as salary FROM Emp e WHERE e.ename = ’Smith’ UNION ALL SELECT e3.eid AS seid, s.meid as meid, e3.salary as salary FROM Emp e3, subs s WHERE e3.mgr = s.seid ) SELECT SUM(s2.salary) FROM subs s2; This way we have obtained a query which traverses only a fraction of the whole hierarchy. It is the ﬁnal query of our motivating example (see Section 2). The predicate e1.ename = ’Smith’ has been successfully pushed into the CTE. The general procedure of optimising recursive SQL query is to ﬁrstly push in all the predicates and columns possible and then to use simpliﬁcation techniques described in 3.

202

5

M. Burza´ nska, K. Stencel, and P. Wi´sniewski

Measured Improvement

In this section we present the results of tests performed on the motivating example of this paper. The tests were performed on two machines: ﬁrst one is equipped with Intel core 2 duo u2500 processor and 2GB RAM memory (let us call it machine A), the other one has phenom x4 9350e processor and 2GB RAM memory (let us call it machine B). Each one of them has IBM DB2 DBMS v. 9.5 installed on MS Vista operating system. The test data is stored within a table Emp(eid, ename, mgr, salary) and consists of 1000 records. This means that the size of the whole materialised hierarchy can be counted in hundreds thousands (its upper bound is half the square of the size of the Emp table). The hierarchy itself was created in such a way to eliminate cycles (which is a common company hierarchy). Tests where performed within two series. The ﬁrst one tested a case when Emp table had index placed only on the primary key. In the second series indices where placed on both the primary key and the ename column.

Fig. 1. Basic query’s plan using index on the Emp table’s primary key. Includes five full table scans, one additional index scan and 2 hash joins that also take some time to be performed

Fig. 2. Optimized query’s plan with index on the Emp’s primary key. This plan has no need for hash joins, also one full table scan and index scan have been eliminated

Let us start by analysing the case when the set of tests was performed on the Emp table that had an index placed on its primary key. The original query was estimated to be performed within 1728.34 timeron units and evaluated in 2.5s on machine A. The query acquired using the method described in this paper (it will be further called the optimised query) was estimated by the DBMS to be performed in 1654.71 timeron units. As for the evaluation plan for the original query 1 it indicates the use of many full table scans in the process of materializing the

Pushing Predicates into Recursive SQL Common Table Expressions

Fig. 3. Basic query’s plan using indices on the Emp table’s primary key and ename column. In comparison to Fig. 1 one of the full table scans has been replaced by less costly index scan. Still two hash joins and four other full table scans remain.

203

Fig. 4. Optimized query’s plan using indices on the Emp table’s primary key and ename column. In comparison to Fig. 3 one full table scan, one index scan and two hash joins have been eliminated. Also this plan has the least amount of full table scans and join operations, therefore it is the least time consuming.

hierarchy and also two full table scans in the outer select subquery. This indicates that ﬁrstly, DBMS does not possess any means to optimise the query using already implemented algorithms. Secondly, the bigger the Emp table, the runtime and resources consumption increase dramatically. The only beneﬁt of having an index placed on the primary key was in the initial step of materializing the CTE. In the global aspect, this is a small proﬁt, because the initial step in the original query still consists of 1000 records, and the greatest resources consumption takes place during the recursive steps. In comparison, the evaluation plan for the optimised query 2 although also having full table scans, beneﬁted in elimination of two hash joins (HSJOIN), that needed full table scans in order to attach Emp table to the materialized CTE. On the machine A this query was evaluated under 1s. The time was so small, because the initial step of CTE was materialized not for all of the 1000 records, but for only a few. The second set of tests was performed with indices placed on both the primary key and the ename column. The original query was evaluated in 2s and the cost in timeron units was estimated at 1681.38. As for the optimised query the corresponding results were 1615.31 timeron units and evaluation time under 1s. As in previous case, the index placed on the primary key was used only in the initial step of materializing CTE. As for the index placed on the ename column it

204

M. Burza´ nska, K. Stencel, and P. Wi´sniewski

Table 1. Results of the described test in timeron units and real time mesurements tests one real time index timeron two real time indices timeron

original optimised opt/orig 2.5 s <1 s > 40% 1728.34 1654.71 95.7% 2s <1 s > 50% 1681.38 1615.31 96%

was used to reduce the amount of records attached to the materialized hierarchy. This way hash join took less time to be evaluated. Nevertheless the evaluation plan still contains many full scans that deal with huge amount of data. As for the optimised query the index placed on the primary key is not used, but the index placed on the ename column speeded up the materialization of the initial step. The results of the test have been placed for comparison in Table 1. It is worth noting, that the timeron cost of original query, despite indexing, is greater than in case of the optimised query. Also basing on this estimation the proﬁt of our method varies between 4 and 5 percent. It may not seem much, but when thinking of bigger initial tables, this is quite a good result. What is more, because this is a method of rewriting SQL into SQL further optimisation (like placement of indices) may be performed.

6

Conclusion

In this paper we have show an optimisation method of recursive SQL queries. The method consists of selecting the predicates which can be pushed into the CTE and moving them. The condition that needs to be satisﬁed is the existance of tree invariant. The beneﬁt of the usage of our method depends on the selectivity of the predicates being pushed and the recursion depth. A highly selective ﬁlter condition which may indirectly reduce the amount of recursion steps will improve the evaluation time in a signiﬁcant way. Even experiments with small tables proved the high potential of the method, since for such small number of rows the reduction of the execution time is substantial.

References 1. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic sets and other strange ways to implement logic programs. In: PODS, pp. 1–15. ACM, New York (1986) 2. Levy, A.Y., Mumick, I.S., Sagiv, Y.: Query optimization by predicate move-around. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB, pp. 96–107. Morgan Kaufmann, San Francisco (1994) 3. Wadler, P.: Deforestation: Transforming programs to eliminate trees. Theor. Comput. Sci. 73(2), 231–248 (1990) 4. Jones, S.P., Tolmach, A., Hoare, T.: Playing by the rules: rewriting as a practical optimisation technique in GHC. In: Haskell Workshop, ACM SIGPLAN, pp. 203–233 (2001)

Pushing Predicates into Recursive SQL Common Table Expressions

205

5. Gill, A.J., Launchbury, J., Jones, S.L.P.: A short cut to deforestation. In: FPCA, pp. 223–232 (1993) 6. Johann, P.: Short cut fusion: Proved and improved. In: Taha, W. (ed.) SAIG 2001. LNCS, vol. 2196, pp. 47–71. Springer, Heidelberg (2001) 7. Ohori, A., Sasano, I.: Lightweight fusion by fixed point promotion. In: Hofmann, M., Felleisen, M. (eds.) POPL, pp. 143–154. ACM, New York (2007) 8. Grust, T., Grust, T., Scholl, M.H., Scholl, M.H.: Query deforestation. Technical report, Faculty of Mathematics and Computer Science, Database Research Group, University of Konstanz (1998) 9. Ordonez, C.: Optimizing recursive queries in SQL. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 834–839. ACM, New York (2005)

On Containment of Conjunctive Queries with Negation Victor Felea “A.I.Cuza” University of Iasi Computer Science Department, 16 General Berthelot Street, Iasi, Romania [email protected] http://www.infoiasi.ro

Abstract. We consider the problem of query containment for conjunctive queries with the safe negation property. Some necessary conditions for this problem are given. A part of the necessary conditions use maximal cliques from the graphs associated to the ﬁrst query. These necessary conditions improve the algorithms for the containment problem. For a class of queries a necessary and suﬃcient condition for containment problem is speciﬁed. Some aspects of time complexity for the conditions are discussed. Keywords: query containment, negation, maximal sets, cliques in graphs.

1

Introduction

Query containment is a very important problem in many management applications, including query optimization, checking of integrity constraints, analysis of data source in the data integration, veriﬁcation of knowledge bases, ﬁnding queries independent of updates, rewriting queries using views. The problem of query containment has already captivated many researchers [5, 10, 12, 13, 14, 15, 16, 20, 23]. In [23]J.D.Ullman presents an algorithm based on canonical databases, using an exponential number of such databases. F.Wei and G.Lausen propose an algorithm that uses containment mappings deﬁned for the two queries in [24]. This algorithm increases the number of positive atoms from the ﬁrst query in the containment problem. Many authors study the problem of query containment under constraints. Thus, in [10] C.Farre and al. present the constructive query containment method to check query containment with constraints. In [14]N.Huyn and al. consider the problem of incrementally checking global integrity constrains. Some authors approach the containment problem for applications in Web services in [8,9,18]. The containment query problem is used for rewriting queries using views by F.Afrati in [1,2]. Checking containment of conjunctive queries without negation (called positive) is an NP-complete problem [5]. It can be solved by testing the existence of a containment mapping corresponding to the two queries. For queries with negation, query containment problem becomes Π2P -complete. M. Leclere and M.L.Mugnier investigate the J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 206–218, 2009. c Springer-Verlag Berlin Heidelberg 2009

On Containment of Conjunctive Queries with Negation

207

containment problem of conjunctive queries using graphs homomorphism giving suﬃcient conditions for query containment in [16]. S. Cohen et al. reduce the containment problem to equivalence for queries with expandable aggregate functions in [7]. In a recent paper the author introduces and studies a notion of strong containment that implies classical containment problem for two queries in conjunctive form with negation[11]. In this paper we specify several necessary conditions for the query containment problem. For a special class of queries a necessary and suﬃcient condition is given. The time complexity of the proposed algorithm depends on the number of containment mappings and the number of these sets of equality relations. F. Wei and G. Lausen show that in the worst case the algorithm proposed by them in [24] has the same performance as the one proposed by J. Ullman in [23]. Considering the number of databases from the Ullman’s algorithm and the time complexity (speciﬁed in Section 5) for the necessary condition from Proposition 9, we remark that the last is better than the complexity of the ﬁrst, which is normal. The following example points out the utility of our research. Example 1. Suppose we have some informations about companies, products, supply operations and import restrictions for certain products. Let us consider the following schema that consists of the relations COM , P ROD, SU P P LY , REST R, where: COM (ComId, Country) contains a set of companies having import or export operations of alimentary products as the activity object. The attribute ComId is an identiﬁer for a company and Country represents the country in which the company identiﬁed by ComId is registered. P ROD(P rodId, P rodN ame) contains dates about products. The attribute P rodId is an identiﬁer for a product and P rodN ame is the product name for the product identiﬁed by P rodId. SU P P LY (ComId1, ComId2, P rodId) contains supply operations: the attribute P rodId represents the product supplied by the company ComId1 to the company ComId2. REST R(P rodId, Country1, Country2) contains import-export restrictions, namely for the country Country2 is not allowed to import the product P rodId from the country Country1. Let us consider two queries denoted Q1 and Q2 about these relations: Q1 : Find all companies ComId such that there exist two products p1 and p2 , two companies ComId1 and ComId2 that satisfy the following property: ComId supplies the product p1 to the company ComId1 and the country corresponding to the company ComId1 has no import restrictions for the product p1 for any import operations from the country corresponding to ComId, and ComId2 supplies the product p2 to the company ComId and the country corresponding to the company ComId has no import restrictions for the product p2 for any import operations from the country corresponding to ComId.

208

V. Felea

Q2 : Find all companies ComId such that there exists a product p and a company ComId1 that satisfy the following property: ComId supplies the product p to the company ComId1 and the country corresponding to the company ComId1 has no import restrictions for the product p for any import operations from the country corresponding to ComId. If we denote by H the head of the two queries and use the variables as arguments of literals, we obtain: Q1 : H(x) : −COM (x, y1 ), P ROD(p1 , y2 ), P ROD(p2 , y3 ), COM (y4 , y5 ), COM (y6 , y7 ), SU P P LY (x, y4 , p1 ), SU P P LY (y6 , x, p2 ), ¬REST R(p1 , y1 , y5 ), ¬REST R(p2 , y7 , y1 ), Q2 : H(x) : −COM (x, z1 ), P ROD(p, z2 ), COM (z3 , z4 ), SU P P LY (x, z3 , p), ¬REST R(p, z1, z4 ). We are interested to ﬁnd if Q1 ⊆ Q2 and if Q2 ⊆ Q1 . In Example 4 we establish that the ﬁrst statement is true and the second is false. This implies that the two queries are not equivalent. The paper is organized as follows: in Section 2 we deﬁne the answer of a query for a database, the problem of query containment and the notion of satisﬁable query. In Section 3 we give several necessary conditions for the two queries to be in the containment relation and we point out a necessary and suﬃcient condition for the containment problem in case when the second query satisﬁes a certain restriction. In Section 4 we specify a method to calculate the sets of equality relations asked for by the condition formulated in Section 3. In Section 5 we give the time complexity for some necessary conditions speciﬁed in Section 3. Finally, the conclusion is presented.

2

Preliminaries

Consider two queries Q1 and Q2 having the following forms: Q1 : H(x) : −f1 (x, y) and Q2 : H(x) : −f2 (x, z), where f1 (x, y) = R1 (w1 ), . . . , Rh (w h ), ¬Rh+1 (w h+1 ), . . . , ¬Rh+p (w h+p ) f2 (x, z) = S1 (w1 ), . . . , Sk (wk ), ¬Sk+1 (wk+1 ), . . . , ¬Sk+n (wk+n )

(1)

The vector x is a variable vector consisting of all free variables from Q1 and Q2 , y, z are vectors that consist of all existentially quantiﬁed variables from Q1 and Q2 , respectively. The symbols Ri and Sj are relational symbols, w i are vectors of variables from x or from y; w j are variable vectors with components from x or from z. The character ”, ” between literals represents the logic conjunction. For the sake of simplicity we consider queries without constants, but the results also follow if there are constants. The diﬀerence consists in the uniﬁer deﬁnition. The following assumptions are made on the variables of the queries from (1): the variables occurring in the head also occur in the body and all variables occurring

On Containment of Conjunctive Queries with Negation

209

in the negated subgoals also occur in the positive ones. The last constraint is called the safe negation property. A database is a set of atoms deﬁned on a value domain Dom of constants or variables. Definition 1. For a query Q1 having the form as in (1) and D a database on Dom, we deﬁne the answer of Q1 for D, denoted Q1 (D) as the set of all H(τ x), where τ is a substitution for variables from x, such that there is a substitution τ1 that is an extension for τ to all variables from y such that D satisﬁes the right part of Q1 for τ1 . Formally, Q1 (D) = {H(τ x) | ∃τ1 an extension of τ so that D |= τ1 f1 (x, y)}

(2)

The notation D |= τ1 f1 (x, y) means: τ1 Rj (wj ) ∈ D, for each j, 1 ≤ j ≤ h and τ1 Rh+i (wh+i ) ∈ D, for each i, 1 ≤ i ≤ p Definition 2. We say that the query Q1 is contained in Q2 , denoted Q1 ⊆ Q2 , if for each domain Dom and database D on Dom, the answer of Q1 for D is contained in the answer of Q2 for D, that means Q1 (D) ⊆ Q2 (D). Definition 3. A query Q1 having the form as in (1) is satisﬁable if there is a database D, such that Q1 (D) = ∅, otherwise it is unsatisﬁable. Proposition 1. [5] A query Q1 as in (1) is unsatisﬁable iﬀ there is Rj (wj ), 1 ≤ j ≤ h and Rh+i (wh+i ), 1 ≤ i ≤ p such that these atoms are identical, that means Rj = Rh+i and their arguments are equal: wj = wh+i . In case where f1 (x, y) satisﬁes the condition of unsatisﬁability from Proposition 1, we denote this by f1 (x, y) = ⊥. Since in case when f1 (x, y) = ⊥, we have Q1 (D) = ∅, it is suﬃcient to consider the case when f1 (x, y) = ⊥. We need to consider the equality relations deﬁned on the set Y = {x1 , . . . xq , y1 , . . . , ym }, where xj , 1 ≤ j ≤ q are all variables from x and yi , 1 ≤ i ≤ m are all variables from y. Let us denote by M a set of equality relations on Y . We express M as: M = {(tα1 , tβ1 ), . . . , (tαs , tβs )}, tαi , tβi ∈ Y . Let us denote by M ∗ the reﬂexive, symmetric and transitive closure of M . Thus, M ∗ produces a set of equivalence classes. We denote by y the class that contains y. We must consider a total order on Y , let us consider this order as x1 < . . . < xq < y1 < · · · < ym . Let us consider a conjunction of literals like f1 (x, y). We deﬁne the conjunction denoted ψM f1 (x, y) by replacing in f1 (x, y) every variable tj with a, where a is the minimum element from the class tj with respect to ”<” order deﬁned on the set Y . By pos(fi ) we mean the set of all atoms from positive part of fi , neg(fi) means the set of all atoms from negated part of fi . By Rel(pos(fi )) we mean the set of all relational symbols from pos(fi ) and Rel(neg(fi )) denotes all relational symbols from neg(fi ). We denote by Rel(fi ) the set of all relational symbols from fi .

3

Some Necessary Conditions for Containment Problem

Proposition 2. We have Q1 ⊆ Q2 iﬀ for each value domain Dom, each database D deﬁned on Dom and each substitution τ1 : x ∪ y → Dom, so that D |= τ1 f1 ,

210

V. Felea

there exists a substitution μ1 , μ1 : x ∪ z → Dom so that D |= μ1 f2 and μ1 (xi ) = τ1 (xi ), 1 ≤ i ≤ q. The following propositions establish some necessary conditions for two queries Q1 and Q2 to be in the containment relation. Proposition 3. Let Q1 and Q2 be two queries having the forms from (1). If Q1 ⊆ Q2 then the following statement holds: for each database D0 deﬁned on Y and each substitution σ : Y → Y such that D0 |= σf1 (x, y), there exists θ a substitution from x ∪ z into Y such that D0 |= θf2 (x, z) and θ(xj ) = σ(xj ), 1 ≤ j ≤ q. Proof. Using Proposition 2 with Dom = Y . In the following the notation τ1 (x) = τ2 (x) means τ1 (xj ) = τ2 (xj ) for each j, 1 ≤ j ≤ q, where τ1 and τ2 are substitutions. In the next propositions we consider certain databases D0 deﬁned on Y . Proposition 4. Let Q1 and Q2 be two queries having the forms from (1). Let Dσ be the databases deﬁned by: Dσ = {R(w)|R ∈ Rel(f1 ), w is a vector of variables from Y , and R(w) ∈ σ neg(f1 )}, for each σ a substitution from Y into Y . If Q1 ⊆ Q2 then the following statement holds: For each σ and Dσ , there exists a substitution θ : x ∪ z → Y , having θ(x) = σ(x) such that Dσ |= θf2 (x, z). Proof. Concerning σ and Dσ we have Dσ |= σf1 (x, y). The conclusion follows from Proposition 3. Proposition 5. Let Q1 and Q2 be two queries as in (1). Let Ms be a maximal set deﬁned on Y × Y , with the property ψMs f1 = ⊥. If Q1 ⊆ Q2 then there exists a containment mapping θ from neg(f2 ) to ψMs neg(f1 ), having θ(x) = ψMs (x) such that: θSj (w j ) ∈ Dσ0 , 1 ≤ j ≤ k and θSk+l (wk+l ) ∈ Dσ0 , 1 ≤ l ≤ n, where σ0 = ψMs . Proof. We take in Proposition 4, σ = ψMs and express the relation Dσ |= θf2 (x, z). Proposition 6. Let Q1 and Q2 be two queries as in (1). Let Ms be a maximal set deﬁned on Y × Y with the property ψMs f1 = ⊥. A necessary condition for Q1 ⊆ Q2 is the following: there exists a containment mapping θ from neg(f2 ) to ψMs neg(f1 ), having θ(x) = ψMs (x) such that: θSk+l (w k+l ) ∈ ψMs neg(f1 ), ∀l, 1 ≤ l ≤ n and

θSj (wj )

∈ Dσ0 ,

∀j, 1 ≤ j ≤ k, where σ0 = ψMs .

(3) (4)

Example 2. Let us consider the queries Q1 : H() : −f1 (y) and Q2 : H() : −f2 (z), where: f1 (y) = a(y2 , y1 ), a(y3 , y2 ), ¬a(y3 ,y1 ), f2 (z) = a(A, B), a(C, D), ¬a(A, C).

On Containment of Conjunctive Queries with Negation

211

We have Y = {y1 , y2 , y3 }, x = ∅, z = (ABCD). There exists one maximal set: M1 = {(y1 , y3 )} computed as in Example 5. We have ψM1 neg(f1 ) = a(y1 , y1 ). A containment θ from Proposition 6 has the form θ(A, B, C, D) = (y1 , , y1 , ), where represents any element from Y . Dσ0 contains all atoms a(tα , tβ ), where tα , tβ ∈ Y and (tα , tβ ) = (y1 , y1 ). If we consider θ1 deﬁned by θ1 (A, B, C, D) = (y1 , y2 , y1 , y2 ), then θ1 satisﬁes the relations (3) and (4) from Proposition 6. Therefore the necessary condition speciﬁed in Proposition 6 for Q1 ⊆ Q2 holds. Example 3. Let us consider two queries Q1 and Q2 , where f1 (y) is as in example 2 and f2 (z) = a(A, B), a(C, D), ¬a(A, C), ¬a(D, B). As in Example 2, Y = {y1 , y2 , y3 }, z = (ABCD), Ms = M1 = {(y1 , y3 )}, ψM1 neg(f1 ) = a(y1 , y1 ). The unique containment mapping θ from Proposition 6 has the form θ(ABCD) = (y1 , y1 , y1 , y1 ), but this θ does not satisfy the relation (4). That means Q1 ⊆ Q2 . For a set of equality relations M we deﬁne a set of mappings from pos f2 into ψM pos(f1 ) denoted FM and deﬁned as follows: FM = {τM |τM Sj (wj ) ∈ ψM pos(f1 ), for each j, 1 ≤ j ≤ k and τM (x) = ψM (x)}. Remark 1. For M = ∅, FM consists of all containment mappings θ from pos(f2 ) into pos(f1 ), such that θSj (wj ) ∈ pos(f1 ), for each j, 1 ≤ j ≤ k and θ(x) = x. Definition 4. We consider two partial order relations on sets of equality relations denoted ” < ” and ” ≤ ”. They are deﬁned as follows: M1 < M2 if M1∗ ⊂ M2∗ , that means the set M1∗ is strictly included in M2∗ ; M1 ≤ M2 if M1 < M2 or M1∗ = M2∗ . The following lemma establishes some properties about the “ < ” relation. Lemma 1. Let M0 < M1 < · · · < Ms a chain of equality relations, let FMi be the set of substitutions corresponding to Mi . We have: 1. ψMi FMj ⊆ FMi , for each i, 0 ≤ i ≤ s, and j, 0 ≤ j ≤ i. 2. ψMs Fi ⊆ FMs , for each i, 0 ≤ i ≤ s. The following theorem points out a suﬃcient condition for the containment relation between two queries Q1 and Q2 . Theorem 1. Let Q1 and Q2 be two queries as in (1). Let us consider the following statements: (i) For each set of equality relations M ⊆ Y × Y , such that ψM f1 (x, y) = ⊥, there exists a mapping τM from FM such that τM Sk+l (wk+l ) ∈ ψM neg(f1 ), for each l, 1 ≤ l ≤ n. (ii) Q1 ⊆ Q2 . We have the statement (i) implies (ii).

212

V. Felea

Proof. Let Dom be a domain of interpretation, D a database deﬁned on Dom and λ1 a mapping from Y into Dom such that D |= λ1 f1 (x, y). Let Y = {t1 , . . . , tq+m } and λ1 (tj ) = vj , 1 ≤ j ≤ q + m. We deﬁne a set of equality relations M by M = {(tα , tβ )|tα < tβ and λ1 (tα ) = λ1 (tβ )}. Let w1 , . . . , wh be all distinct elements from {v1 , . . . , vq+m }. Let ij = min{l, 1 ≤ l ≤ q + m, wj = vl }, 1 ≤ j ≤ h. Consider the mapping λ1 deﬁned as: λ1 (tij ) = vij , 1 ≤ j ≤ h. We have: λ1 = λ1 ψM . Using the statement (i), we get: D |= λ1 τM f2 (x, z) and λ1 τM (xj ) = λ1 (xj ), 1 ≤ j ≤ q. Hence, we have shown Q1 ⊆ Q2 . The following proposition establishes a necessary condition for the containment problem of the queries that satisfy a certain restriction. Proposition 7. Let Q1 and Q2 be two queries such that Rel(neg(f2 )) is disjoint to Rel(pos(f2 )). The following condition is necessary for the containment problem Q1 ⊆ Q2 . (a) There exists a mapping θ from FM0 , where M0 = ∅, such that θSk+l (w k+l ) ∈ neg(f1 ), for each l, 1 ≤ l ≤ n. Proof. Assume the contrary of the statement (a). It follows: (∀θ)(θ ∈ FM0 )(∃l)(1 ≤ l ≤ n)[θSk+l (wk+l ) ∈ neg(f1 )]. Let us deﬁne a set M as follows: M = {θSk+l (w k+l )|θSk+l (wk+l ) ∈ neg(f1 )}. Using these notations, we have: pos(f1 ) ∪ M |= f1 , pos(f1 ) ∪ M |= θf2 , ∀θ ∈ FM0 and pos(f1 ) ∪ M |= θ f2 , ∀θ ∈ FM0 These statements contradicts the hypothesis Q1 ⊆ Q2 . In [16] M. Leclere and M. L. Mugnier use graph homomorphisms to study necessary or suﬃcient conditions for the containment problem. They emphasize the restriction about f2 from Proposition 7 in Property 4. Let us denote by P (M ) the following predicate: P (M ) : (∃θ)(θ ∈ FM )(∀l)(1 ≤ l ≤ n)[θSk+l (w k+l ) ∈ ψM neg(f1 )] In the following proposition we establish the fact P (M0 ) implies P (M ), where M0 = ∅ and M is any set of equality relations on Y . Proposition 8. Let M be a set of equality relations on Y and M0 = ∅. Then P (M0 ) implies P (M ). The proof results from the deﬁnition of the predicate P (M ) and Lemma 1. In the following theorem we specify a necessary and suﬃcient condition for the containment problem in case when there is no relational symbol that occurs in the both positive part and negated part of f2 . Theorem 2. Let Q1 and Q2 be two queries as in (1). If Rel(neg(f2 ))∩ Rel(pos(f2 )) = ∅, then a necessary and suﬃcient condition for Q1 ⊆ Q2 is P (M0 ) = TRUE.

On Containment of Conjunctive Queries with Negation

213

Proof. If P (M0 ) = TRUE, from Proposition 8 we get P (M ) = TRUE for each M , therefore by Theorem 1 we obtain Q1 ⊆ Q2 . From Proposition 7, if Q1 ⊆ Q2 then we have P (M0 ) = TRUE. Example 4. Let Q1 and Q2 be the queries from Example 1. Considering the problem if Q1 ⊆ Q2 , we ﬁnd the mapping θ from pos(f2 ) into pos(f1 ), where θ(z1 , z2 , z3 , z4 , p) = (y1 , y2 , y4 , y5 , p1 ). For the mapping θ we have θneg(f2 ) = REST R(p1, y1 , y5 ), hence it results θneg(f2 ) ∈ neg(f1 ). This means P (M0 ) = TRUE and using Theorem 2 we get Q1 ⊆ Q2 . For the problem if Q2 ⊆ Q1 we have eight mappings in FM0 . For all θ ∈ FM0 we have: θREST R(p1 , y1 , y5 ) ∈ REST R(p, z1, z4 ) or θREST R(p2, y7 , y1 ) ∈ REST R(p, z1, z4 ). By Theorem 2, we obtain Q2 ⊆ Q1 . Now we consider Q2 without the restriction speciﬁed in Theorem 2. Hence, in the following we consider the queries Q1 and Q2 such that Rel(neg(f2 )) ∩ Rel(pos(f2 )) = ∅. In the following propositions we give some necessary conditions for the containment problem in this case. Proposition 9. Let Q1 and Q2 be two queries as in (1), such that the query Q2 satisﬁes Rel(neg(f2 )) ∩ Rel(pos(f2 )) = ∅. If Q1 ⊆ Q2 then the following statement holds: For each set of equality relations M deﬁned on Y such that ψM f1 (x, y) =⊥ we have one of the following: (i) there exists a containment mapping θ from FM such that θSk+l (wk+l ) ∈ ψM neg(f1 ), for each l, 1 ≤ l ≤ n, or (ii) there exists a substitution θ from x ∪ z into Y such that θ ∈ FM and θ (x) = ψM (x) and θ Sj (wj ) ∈ ψM pos(f1 ) ∪ M2 , for each j, 1 ≤ j ≤ k and θ Sk+l (wk+l ) ∈ ψM pos(f1 ) ∪ M2 , for each l, 1 ≤ l ≤ n, where M2 = {θSk+l (wk+l )|θ ∈ FM , θSk+l (wk+l ) ∈ ψM neg(f1 ), 1 ≤ l ≤ n}. Proof. Let Q1 ⊆ Q2 and M be a set such that ψM f1 (x, y) = ⊥. If the assertion (i) is satisﬁed we are ready. Assume that (i) is not true for M . That means the following statement yields: (∀θ)(θ ∈ FM )[(∃l)(1 ≤ l ≤ n)(θSk+l (wk+l ) ∈ ψM neg(f1 ))]

(5)

Consider the following database D deﬁned on Y as follows: D = ψM pos(f1 ) ∪ M2 . We have: D |= ψM f1 and D |= θf2 , for each θ ∈ FM .

(6)

Since Q1 ⊆ Q2 it follows that there exists θ ∈ FM , θ a mapping from x ∪ z into Y , such that: D |= θ f2 and θ (x) = ψM (x). (7) The mapping θ from (7) satisﬁes the assertion (ii).

214

V. Felea

Remark 2. If we consider in Proposition 9 only equality relations Ms , that are maximal, then we obtain a necessary condition for the Q1 ⊆ Q2 -problem, expressed with maximal sets Ms from Y ×Y , under the constraint ψMs f1 (x, y) = ⊥. Propositions 5, 6 and Remark 2 contain necessary conditions formulated with maximal sets Ms . In the following we focus on these necessary conditions and we establish a method to construct these maximal sets.

4

Sets of Equality Relations

We are interested to calculate the sets of equality relations M on Y , such that ψM f1 (x, y) = ⊥. Concerning two sets of equality relations M1 , M2 on Y we have the following result: Lemma 2. Let M1 and M2 be two sets of equality relations, such that M1 ≤ M2 . If ψM1 f1 (x, y) = ⊥, then ψM2 f1 (x, y) = ⊥. First of all, we consider the equation ψM f1 (x, c) = ⊥. We need to ﬁnd all sets M that satisfy this equation. In order to do this we point out the argument for each literal from f1 . Let l1 = R(t1 , . . . , tp ) be an atom from pos(f1 ) and l2 = R(t1 , . . . , tp ) from neg(f1 ). Corresponding to the pair (l1 , l2 ) we deﬁne an expression denoted El1 ,l2 and deﬁned as follows: El1 ,l2 = (t1 = t1 ) ∧ . . . ∧ (tp = tp )

(8)

where ” ∧” is logic conjunction. From El1 ,l2 we eliminate all (ti = ti ) whenever ti and ti coincide. Let us consider the set M of all pair of literals (l1 , l2 ) having the form speciﬁed above. We deﬁne an expression E that consists of the disjunction of all expressions El1 ,l2 , where (l1 , l2 ) ∈ M. Now, we take the negation of E denoted ¬E. We have: ¬E ≡ ∧{¬El1 ,l2 |(l1 , l2 ) ∈ M}, where ¬El1 ,l2 ≡ ¬(t1 = t1 )∨. . .∨¬(tp = tp ) (9) Using the distributivity of conjunction versus disjunction, we obtain an expression equivalent to ¬E but in a disjunctive form: ¬E ≡ E1 ∨ . . . ∨ Er , where Ej has the form: Ej = ¬(tα1 , tβ1 ) ∧ . . . ∧ ¬(tαs , tβs ) (10) Let Ei and Ej be two disjunctive terms from ¬E, where Ej has the form as in (10) and Ei = ¬(tγ1 , tδ1 ) ∧ . . . ∧ ¬(tγh , tδh ). If each element ¬(tγk , tδk ) from Ei occurs also in Ej , 1 ≤ k ≤ h, then we eliminate Ej from ¬E, because in this case we have Ei ∨ Ej = Ei . Let M0 be the set Y × Y = {(y, y )|y, y ∈ Y }. For each Ej , 1 ≤ j ≤ r we construct a graph Gj = (Y, Vj ), where the vertex set is Y and the edge set is Vj that consists of all pairs (yl , yh ), yl , yh ∈ Y except to the pairs (tαl , tβl ) which occur in Ej . This means: Vj =M0 − {(tα1 , tβ1 ), . . . , (tαs , tβs )}. We need to compute all sets M from Y × Y having the property: ψM f1 = ⊥. Firstly, we compute maximal cliques from Gj . Let C1 , . . . , Cp be all maximal cliques from

On Containment of Conjunctive Queries with Negation

215

the graph Gj . If Cl is a clique having as vertex set Vl ={y1 , . . . , yh } , then we consider the edge set of Vl , denoted Vl , where Vl = {(yi , yj ), 1 ≤ i, j ≤ h}, 1 ≤ l ≤ p. Now we take all unions of pair-wise disjoint subsets of diﬀerent Vl , 1 ≤ l ≤ p. Let us denote by S1 (Gj ) the class of sets that is obtained in this manner. Let us denote by S2 (Gj ) all sets from S1 (Gj ) that are maximal. Let us denote by S1 = ∪{S1 (Gj )|1 ≤ j ≤ r} and S2 = ∪{S2 (Gj )|1 ≤ j ≤ r}. Concerning these classes of sets we have the following result. Theorem 3. Let M be a set of equality relations on Y . The following two assertions hold. (a) The set of equality relations M is a maximal set with the property ψM f1 =⊥ iﬀ M ∗ ∈ S2 . (b) The set of equality relations M is a set with the property ψM f1 = ⊥ iﬀ M ∗ ∈ S1 . Remark 3. It is known that the problem of enumerating all maximal cliques in a graph is NP-hard [4, 19, 22]. Example 5. Let Q1 be the following query: Q1 : H : −f1 (y), where f1 (y) = a(y1 , y2 ), a(y2 , y3 ), a(y3 , y4 ), ¬a(y1 , y4 ). The variables yi , 1 ≤ i ≤ 4 are considered existentially quantiﬁed in f1 (y). The set Y is as follows: Y = {y1 , . . . , y4 }. We must ﬁnd all sets of equality relations M on Y , such that ψM f1 (y) = ⊥. Let l1 , l2 , l3 , l4 be the atoms from f1 . We have: El1 ,l4 = (y2 = y4 ), El2 ,l4 = (y1 = y2 ) ∧ (y3 = y4 ), El3 ,l4 = (y1 = y3 ). The expression E has the form: E = {(y2 = y4 ))} ∨ {(y1 = y2 ) ∧ (y3 = y4 )} ∨ {(y1 = y3 )}. The expression ¬E has the form: ¬E ≡ E1 ∨ E2 , where E1 = ¬(y2 = y4 ) ∧ ¬(y1 = y2 ) ∧ ¬(y1 = y3 ) and E2 = ¬(y2 = y4 ) ∧ ¬(y3 = y4 ) ∧ ¬(y1 = y3 ). The cliques of the graph G1 are: {(y1 , y4 )}, {(y2 , y3 )}, {(y3 , y4 )}. The classes S1 (G1 ) and S2 (G1 ) are the following: S1 (G1 ) = {{(y1 , y4 )}, {(y2 , y3 )}, {(y3 , y4 )}, {(y1 , y4 ), (y2 , y3 )}}, S2 (G1 ) = {{(y3 , y4 )}, {(y1 , y4 ), (y2 , y3 )}}. The cliques of the graph G2 are: {(y1 , y2 )}, {(y2 , y3 )}, {(y1 , y4 )}. The classes S1 (G2 ) and S2 (G2 ) are the following: S1 (G2 ) = {{(y1 , y2 )}, {(y2 , y3 )}, {(y1 , y4 )}, {(y1 , y4 ), (y2 , y3 )}}, S2 (G2 ) = {{(y1 , y2 )}, {(y1 , y4 ), (y2 , y3 )}}. The classes S1 and S2 are the following: S1 = {{(y1 , y4 )}, {(y2 , y3 )}, {(y3 , y4 )}, {(y1 , y4 ), (y2 , y3 )}, {(y1 , y2 )}}, S2 = {{(y3 , y4 )}, {(y1 , y4 ), (y2 , y3 )}, {(y1 , y2 )}}. Every set from S1 has the property ψM f1 = ⊥ and every set from S2 is maximal with the same property.

216

5

V. Felea

The Time Complexity of Some Necessary Conditions

We consider the queries Q1 and Q2 speciﬁed in (1). For the necessary conditions speciﬁed in the paper let us consider some of conditions expressed by sets of equality relations M that are maximal with the property ψM f1 = ⊥(propositions 5 and 6 and Proposition 9 in case when M is maximal). An algorithm P1 that computes all these maximal sets can be speciﬁed using the results of Section 4. Let us denote by |S| the number of elements of S. We obtain the following statements: (a) |FM | ≤ hk and the time complexity to test the condition from Theorem 2 is in O(|FM0 | ∗ p ∗ n). Hence, it is the same as for positive queries [5]. (b) Let us consider the algorithm P1 that computes all maximal sets Ms such that ψMs f1 = ⊥. We have: |pos(f1 )| = h, |neg(f1 )| = p. Let R1 , . . . , Rd be all relational symbols such that Rl occurs in both Rel(pos(f1 )) and Rel(neg(f1)), 1 ≤ l ≤ d. Let nl be the arity of Rl . Let us denote by m1,l , m2,l the number of atoms having Rl as relational symbol from Rel(pos(f1 )) and Rel(neg(f1)), respectively. The number of expressions Ej from (10), denoted by r satisﬁes the l=d m1,l ∗m2,l following inequality: r ≤ Σl=1 n . It is clear that m1,l ≤ h and m2,l ≤ p. Let us denote by nmax the maximum of all integers nl , 1 ≤ l ≤ d. Hence the integer r satisﬁes: r ≤ d ∗ nh∗p max . Let C1 , . . . , Cpj be all maximal cliques of the graph Gj , 1 ≤ j ≤ r. Let sj be the integer deﬁned as: sj = max{|Vl |, 1 ≤ l ≤ pj }. We have sj ≤ q + m and pj ≤ 3(q+m)/3 (the time complexity to ﬁnd all maximal cliques from a graph with m vertices is O(3m /3 ) [22]). If we denote by c1,j and c2,j the time complexity to obtain the classes S1 (Gj ), S2 (Gj ), respectively, then we obtain: c1,j ≤ 2pj ∗sj , c2,j ≤ c21,j . Hence, the time complexity to compute the sets from S1 is O(t1 ), where t1 = r ∗ 2t3 and t3 = (q + m) ∗ 3(q+m)/3 , and the time complexity to compute S2 is O(t2 ), where t2 = r ∗ 22∗t3 . The time complexity for the condition (i) from Proposition 9 when M is a ﬁxed maximal set is in O(t4 ), where t4 = hk ∗ p ∗ n. The number of the elements from M2 from Proposition 9 is at most |FM | ∗ n ≤ hk ∗ n. The number of all substitution θ from Proposition 9 is in O(h2k ∗ nk ). The time complexity for the condition (ii) from Proposition 9 for a ﬁxed set M is in O(t5 ), where t5 = h3∗k ∗ nk+1 ∗ max{k, n}. The time complexity for the condition of Proposition 9 is in O(ne ∗ max{t4 , t5 }), where ne is the number of sets from S1 . If in Proposition 9 we consider only maximal sets, then the expression of complexity is the same as above, where ne is the number of sets from S2 . N.Tamas and C.Gabor give functions to compute all or maximal cliques in a graph [21].

6

Conclusion

In this paper we have discussed the problem of query containment and some of necessary conditions for the containment problem of queries were given. Some of these conditions are expressed by maximal sets associated to the ﬁrst query. For a class of queries a necessary and suﬃcient condition for the problem containment

On Containment of Conjunctive Queries with Negation

217

was speciﬁed. Some aspects of time complexity for the conditions was given. Theorem 1 speciﬁes a suﬃcient condition for the containment problem.

References 1. Afrati, F., Pavlaki, V.: Rewriting Queries Using Views with Negation. AI Communications 19, 229–237 (2006) 2. Afrati, F., Mielikainen, T.: Advanced Topics in Databases. University of Helsinki (2005) 3. Akkoyunlu, E.A.: The enumeration of maximal cliques of large graphs. SIAM Journal of Computing 2, 1–6 (1973) 4. Bomze, I.M., Budinich, M., Pardalos, P.M., Pelillo, M.: The maximum clique problem. In: Handbook of Combinatorial Optimization, vol. 4, pp. 1–74 (1999) 5. Chandra, A.K., Merlin, P.M.: Optimal implementations of conjunctive queries in relational databases. In: ACM Symp. on Theory of Computing (STOC), pp. 77–90 (1977) 6. Cohen, S.: Containment of Aggregate Queries. ACM SIGMOD 34(1), 77–85 (2005) 7. Cohen, S., Nutt, W., Sagiv, Y.: Containment of Aggregate queries, http://www.macs.hw.ac.uk/~ nutt/Publications/icdt03.pdf 8. Deutsch, A., Tannen, V.: XML queries and constraints, containment and reformulation. Theoretical Computer Science 336(1), 57–87 (2005) 9. Dong, X., Halevy, A.Y., Tatarinov, I.: Containment of Nested XML Queries, http://data.cs.washington.edu/papers/nest-vldb.pdf 10. Farre, C., Teniente, E., Urpi, T.: Checking query containment with CQC method. Data and Knowledge Engineering 53(2), 163–223 (2005) 11. Felea, V.: A Strong Containment Problem for Queries in Conjunctive Form with Negation. In: Proceedings on The First DBKDA 2009, Cancun, Mexico, March 1-6 (2009), http://profs.info.uaic.ro/~ felea/FeleaVictor-DB09.pdf 12. Florescu, D., Levy, A., Suciu, D.: Query containment for conjunctive queries with regular expressions. In: ACM Symp. on Principles of Database Systems (PODS), pp. 139–148 (1998) 13. Halevy, A.Y.: Answering Queries Using Views: A survey. VLDB Journal 10(4), 270–294 (2001) 14. Huyn, N.: Eﬃcient Complete Local Tests for Conjunctive Query Constraints with Negation, http://dbpubs.stanford.edu/pub/1966-26 15. Lausen, G., Wei, F.: On the containment of conjunctive queries. In: Klein, R., Six, H.-W., Wegner, L. (eds.) Computer Science in Perspective. LNCS, vol. 2598, pp. 231–244. Springer, Heidelberg (2003) 16. Leclere, M., Mugnier, M.L.: Some algorithmic improvements for the containment problem of conjunctive queries with negation. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 404–418. Springer, Heidelberg (2007) 17. Chen, L.: Testing Query Containment in the Presence of Binding Restrictions, Technical report (1999) 18. Ludascher, B., Nash, A.: Web service composition through declarative queries: the case of conjunctive queries with union and negation. In: Proc. 20th Intern. Conf. on Data Engineering, pp. 840–860 (2004) 19. Makino, K., Uno, T.: New algorithms for enumerating all maximal cliques. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 260–272. Springer, Heidelberg (2004)

218

V. Felea

20. Millstein, T., Levy, A., Friedman, M.: Query Containment for Data Integration Systems. In: Proc. of Symp. on Principles of Database Systems, pp. 67–75 (2000) 21. Tamas, N., Gabor, C.: http://www.cneurocvs.rmki.kfki.hu/igraph/doc/R/cliques.html 22. Tomita, E., Tanaka, A., Takahashi, H.: The worst-case time complexity for generating all maximal cliques. In: Proc. 10th Int. Computing and Combinatorics Conf. (2004); also in Theoretical Computer Science 363(1), 28–42 (2006) 23. Ullman, J.D.: Information integration using logical views. In: International Conference on Database Theory (ICDT), pp. 19–40 (1997) 24. Wei, F., Lausen, G.: Containment of Conjunctive Queries with Safe Negation. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 343–357. Springer, Heidelberg (2003)

Optimizing Maintenance of Constraint-Based Database Caches Joachim Klein and Susanne Braun Databases and Information Systems, Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern, Germany [email protected], s [email protected]

Abstract. Caching data reduces user-perceived latency and often enhances availability in case of server crashes or network failures. DB caching aims at local processing of declarative queries in a DBMSmanaged cache close to the application. Query evaluation must produce the same results as if done at the remote database backend, which implies that all data records needed to process such a query must be present and controlled by the cache, i. e., to achieve “predicate-speciﬁc” loading and unloading of such record sets. Hence, cache maintenance must be based on cache constraints such that “predicate completeness” of the caching units currently present can be guaranteed at any point in time. We explore how cache groups can be maintained to provide the data currently needed. Moreover, we design and optimize loading and unloading algorithms for sets of records keeping the caching units complete, before we empirically identify the costs involved in cache maintenance.

1

Motivation

Caching data in wide-area networks close to the application removes workload from the server DBMS and, in turn, enhances server scalability, reduces userperceived latency and often enhances data availability in case of server crashes or network failures. Simpler forms of caching, e. g., Web caching, keep a set of individual objects in the cache and deliver—upon an ID-based request—the object, if present, to the user [3]. In contrast to Web caching, DB caching is much more ambitious and aims at declarative, i. e., SQL query evaluation in the cache which is typically allocated near by an application server at the edge of the Internet. If the cache holds enough data to evaluate a query, it can save response time for the client request, typically issued by a transaction program running under control of the application server. If only a partial or no query result can be derived, the remaining query predicates have to be sent to the backend database to complete query evaluation and to return the result back to the cache. Hence, DB caching substantially enhances the usability of cached data and is a promising solution for various important and performance-critical applications. However, local query evaluation must produce the same answer as if done at the remote DB backend, which implies that all data records needed to process a speciﬁc predicate must be present and controlled by the cache. In contrast to J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 219–234, 2009. Springer-Verlag Berlin Heidelberg 2009

220

J. Klein and S. Braun

conventional caching, the cache manager has, at any point in time, to guarantee “predicate-speciﬁc” loading or unloading of such record sets. The simplest way to accomplish such a “completeness” is to cache the whole contents of frequently visited tables by full-table caching [13]. Because this static solution does not allow to respond to the actual query workload, more ﬂexible approaches are needed. Some of them are based on materialized views, but limited to them, i. e., they only support single-table queries [2,4,6,9,10,11]. In contrast, constraintbased DB caching uses speciﬁc cache constraints, by which the cache manager can guarantee completeness, freshness, and correctness of cache contents and support multi-table queries. These constraints equip the cache manager with “semantic” knowledge to take care of “predicate completeness” and achieve eﬀective cache maintenance—prerequisites for correct and eﬃcient query evaluation. So far, some vendors provide DB caches based on similar implementation ideas [1,10,14]. Approaches to semantic caching [8] or view caching [2] control record sets in single tables, whereas DB caching supports cache consistency and predicate completeness across multiple tables. Because contribution [7] deﬁning its basics only deals with query evaluation at a logical level, both performance aspects and cache maintenance were not considered so far. Hence, we complement this work by a DB-based SQL implementation and an empirical study of novel algorithms for loading and unloading groups of cache tables. For this reason, Sect. 2 brieﬂy repeats the cornerstones of DB caching, whereas Sect. 3 explains the key problems, sketches the measurement environment, and outlines results achieved by known algorithms. The performance weaknesses observed lead the new maintenance algorithms introduced in Sect. 4 and Sect. 5. After the presentation of our performance gains for the maintenance tasks, we conclude the paper in Sect. 6.

2

Constraint-Based Database Caching

Constraint-based DB caching (CbDBC) maintains a set of cache tables forming a cache group, where speciﬁc constraints control its content. Valid states of the cache are accomplished when all cache constraints are satisﬁed. But, they are continuously challenged, because existing cache data has to be updated (due to modiﬁcations in the backend), unreferenced cache records have to be removed to save needless overhead for consistency preservation, and new records enter the cache due to locality-of-reference demands. Using the cache constraints, the cache manager is able to decide which data has to be kept and which queries can be answered. A complete description of the concepts of CbDBC can be found in [7]. Here, we brieﬂy repeat the most important concepts for comprehension. Each cache table corresponds to a backend table and contains, at any point in time, a subset of the related records in the backend table. For ease of management, cache tables have identical columns and column types as the respective backend tables, however, without adopting the conventional primary/foreign key semantics. Instead, cache table columns can be controlled by unique (U) and nonunique constraints (NU, for arbitrary value distributions). Those columns gain their constraining eﬀect by the value-completeness property.

Optimizing Maintenance of Constraint-Based Database Caches

221

Definition 1 (Value completeness). A value v is value complete (or complete for short) in a column S.c if and only if all records of σc=v SB are in S. Here, S is the cache table, SB its corresponding backend table, and S.c a column c of S. Completeness of a value in a NU column requires loading of multiple records, in general, whereas appearance of a value in a U column automatically makes it value complete. Apparently, value completeness supports the evaluation of equality predicates on the related columns. A further mechanism enables the evaluation of equi-joins in the cache. A socalled referential cache constraint (RCC) links a source column S.a to a target column T.b (S and T not necessarily diﬀerent) and enforces for a value v appearing in S.a value completeness in T.b. Therefore, values in S.a are called control values for T.b. Definition 2 (Referential cache constraint, RCC). A referential cache constraint S.a → T.b from a source column S.a to a target column T.b is satisﬁed if and only if all values v in S.a are value complete in T.b. Only records frequently referenced by queries, i. e., those having high locality in the cache, are beneﬁcial for caching. Therefore, we have designed a special ﬁlling mechanism based on a so-called ﬁlling column, e. g., T 1.a in Fig. 1a. For ﬁlling control, we deﬁne for it a ﬁll table f tab(T 1), an RCC f tab(T 1).id → T 1.a, and a set cand(T 1) which contains the desired candidate values eligible to initiate cache ﬁlling.1 Upon a query reference of value v listed in cand(T 1), e. g., by Q = σa=v T 1, it is inserted into f tab(T 1).id, if not already present, and, hence, called ﬁll value. Via the related RCC, such a value implies value completeness of v in T 1.a and, therefore, loading of all records σa=v T 1B into T 1. To satisfy all cache constraints, RCCs emanating from T 1 may trigger additional load operations. As a consequence of v’s completeness, values—so far not present in their source columns (e. g., T 1.a and T 1.b)—may have entered T 1 and, in turn, imply completeness in their target columns. The newly inserted records in those target tables may again trigger—via outgoing RCCs (e. g., T 2.a)—further load operations, until all cache constraints are satisﬁed again. An example of a cache group together with its constituting components is shown in Fig. 1a.

3

Cache Loading and Unloading

To describe cache loading in detail, we need some further terminology. For any RCC S.a → T.b deﬁned in a cache group, the set of records to be (recursively) cached due to the existence of v in S.a, is called closure of v in S.a. Refer to Fig. 1b: Loading a record with value T 1.a = ‘Jim’ to T 1 implies to satisfy T 1.a → T 2.b, which adds records (‘p4711’,‘Jim’, ...; ‘p4810’, ‘Jim’, ...) to T 2. In turn, these records insert new values ‘p4711’ and ‘p4810’ to T 2.a, which enforce satisfaction of RCC T 2.a → T 3.a. Hence, the new values have to be made value 1

In contrast, DBCache [1] uses the cache key concept, which implies caching of any value referenced in the related column.

222

J. Klein and S. Braun

Fig. 1. Cache group: constraint speciﬁcation (a), load eﬀect of value ‘Jim’ (b)

complete in T 3.a, for which the example in Fig. 1b assumes that the records (‘p4810’,111,...; ’p4810’, 222,...) have to be inserted into T 3. Note, the closure of ‘Jim’ in T 1.a contains the records in T 2 controlled by ‘Jim’ and, in turn, all dependent closures recursively emanating from control values included, e. g., the closure of ‘p4810’ in T 2.a contains records in T 3, as illustrated in Fig. 1b. Of special importance is the loading/unloading eﬀect of a ﬁll value, because it initiates cache loading or is subject to cache removal. The respective set of records is, therefore, called caching unit (CU). A ﬁll value (e. g. ‘Jim’ for the ﬁlling column T 1.a) is managed by the id column of its ﬁll table f tab(T 1) which is the source column of a special RCC f tab(T 1).id → T 1.a. Hence, inserting/removing ‘Jim’ to/from f tab(T 1).id implies loading/unloading of an entire caching unit CUT 1 (‘Jim’). The set of records addressed by the CUT 1 (‘Jim’) is not necessarily the actual set to be considered by the cache manager for load/unload actions. Constraints of diﬀerent CUs in a cache group may interfere and may refer to the same records such that record sets belong to more than one CU. Assume loading of CUT 1 (‘Joe’) causes the insertion of record (‘p4810’,‘Joe’,...) into T 2. Because value ‘p4810’ in T 2.a is already present and, in turn, T 3.a is already value complete for ‘p4810’, no further loading of T 3 is necessary. On the other hand, the closure of ‘p4810’ must not be removed in case CUT 1 (‘Jim’) is unloaded by the cache manager. Hence, when loading/unloading a caching unit, only records exclusively addressed by this CU—also denoted by CU diﬀerence—are subject to cache maintenance. This requirement ensuring correct query evaluation in the cache adds quite some complexity to cache management. 3.1

Key Problems

The structure of a cache group can be considered as a directed graph with cache tables as nodes and RCCs as edges (see Fig. 1a). Handling of cycles in such graphs

Optimizing Maintenance of Constraint-Based Database Caches

223

Fig. 2. Separation of a cache group in atomic zones

is the main problem and, for that reason, considered separately in Sect. 3.1. To gain a directed acyclic graph (DAG), we isolate cycles in so-called atomic zones (AZ) to manage them separately. Hence, in the simplest case, every cache table is a single atomic zone (trivial atomic zone). Otherwise, tables belonging to a cycle are assigned to the same atomic zone (non-trivial atomic zone). Fig. 2 shows a cache group example with segmentation into atomic zones. Separation into atomic zones allows us to consider cache group maintenance in the resulting DAG from a higher level of abstraction [5]. Each atomic zone has to be loaded in a single atomic step, i. e., under transaction protection, to guarantee consistent results of concurrent queries. Reconsider Fig. 2: When a caching unit CUnew is loaded, top-down ﬁlling, i. e., AZ 1 before AZ 2 and AZ 3 , would imply all aﬀected atomic zones had to be locked till loading of CUnew is ﬁnished, because use of AZ 1 , when related AZ 2 and AZ 3 are unavailable, would risk inconsistent query results. In contrast, a bottom-up approach allows to consistently access AZ 2 or AZ 3 for records in CUnew (e. g., when evaluation of a query predicate is conﬁned by the atomic zone), although loading of the corresponding AZ 1 is not ﬁnished. The reversed sequence can be used during unloading. After having removed its ﬁll value from f tab(T 1), AZ 1 can be “cleaned” before AZ 2 and AZ 3 (within three transactions). Cycles. By encapsulating cycles in atomic zones, we are now ready to consider their speciﬁc problems. An RCC cycle is said to be homogeneous 2 , if it involves only a single column per table, for example, T 2.c → T 3.a, T 3.a → T 2.c in Fig. 2. Loading of a homogeneous cycle is safe, because it stops after the ﬁrst pass through the cycle is ﬁnished [7]. Unloading, however, may be complicated in homogeneous cycles due to an interdependency of records, as shown in the following example. Example 3.1. [Dependencies in homogeneous cycles] Fig. 3 represents a homogeneous cycle where ‘Jim’ should be deleted from AZ3 . If we now try to ﬁnd out whether or not ‘Jim’ can be removed from the cycle, we 2

Heterogeneous cycles may provoke recursive loading and are, therefore, not recommended in [7].

224

J. Klein and S. Braun

Fig. 3. Internal vs. external dependencies within an homogeneous cycle

have to resolve the cyclic dependency in T 1.a → T 2.a → T 3.a → T 4.a → T 1.a. A standard solution is to mark records to identify those already visited. However, records cannot only be involved in an internal dependency within the cycle, but also in an external dependency. Such a dependency would exist if value ‘Jim’ would be present in S.b. Then, due to the RCC S.b → T 2.a in Fig. 3, value ‘Jim’ would be kept in T 2.a and no records would be deletable in this example. But, a table in a cycle may have no matching records. For example, if records such as (‘Jim’, ’18/337’) would not exist in T 4B , the cycle is broken for this speciﬁc value. Assume a broken cycle for ‘Jim’ in T 4.a and simultaneously the existence of ‘Jim’ in S.b, then only (‘Jim’, ’Agent’) could be deleted from T 1. Due to the illustrated problems, contribution [7] recommends deletion of the complete cache content, which implies that caching units with high locality would be reloaded immediately. Therefore, selective unloading, executed as an asynchronous thread, can save reﬁll work and provide more ﬂexible options to maintain the cache content. Sect. 5 provides concepts for proper unloading of caching units and describes some implementation details used in our prototype ACCache. 3.2

Measurements

The main objective of our empirical experiments is to gain some estimates for the maintenance of some basic cache group structures. In all measurements, we have stepwise increased the amount of records to be loaded or deleted, where the given number corresponds to the size of the CU diﬀerence caused by the related ﬁll value. In all cases, we have distributed the aﬀected records as uniformly as possible across the cache tables involved. Important cache group types. We have previously argued that the lengths of RCC chains and homogeneous cycles are interesting for practical cache management. Using this directive, we have measured the maintenance costs of some basic cache groups, as illustrated in Fig. 4. Data generator. To provide suitable data distributions and cache group contents, a data generator—tailor-made for our experiments—analyses the speciﬁc cache group and generates records corresponding to a CU diﬀerence of a given size

Optimizing Maintenance of Constraint-Based Database Caches

225

Fig. 4. Important cache group types

and assigns them uniformly to the backend DB tables involved. All tables have seven columns; if a column is used in an RCC deﬁnition, data type INTEGER is used, whereas all other columns have data type VARCHAR(300). Measurement environment. For all measurements, we use the ACCache system [5] based on an existing DBMS in front- and backend (i. e. DB2), which was extended by the functionality described in Sect. 4 and 5. Applications participating in a test run are speciﬁed by means of three worker nodes: the simulated client application triggering loading of CUs by sending SQL queries, the ACCache system, and the backend DBMS. We implemented a tailor-made measurement environment which enabled controlled test runs: each test run was repeated 6 times with newly generated CU data where the sizes of the CU diﬀerences remained stable, but with data of varying content. Hence, all results presented are average response times of dedicated measurements. Because we wanted to explore the performance and quality of load methods separated from network latency, we run the applications in a local-area network where data transmission costs are marginal. In the Internet, these costs would be dominated by possibly substantial network delays.

4

Loading of Caching Units

To preserve cache consistency, entire caching units, i. e., all records implied by the insertion of a ﬁll value, must be loaded at a time. Of course, duplicates are removed if records—also belonging to other CUs—are already present. Because SQL insertions are always directed to single tables, records to be loaded are separately requested for each participating table (which coincides with an atomic zone in non-cyclic cases) from the backend. 4.1

Direct Loading

The ﬁrst method directly inserts the records into the cache tables where the atomic zones are loaded bottom-up. The quite complex details are described in [5]. In principle, the cache manager requests the data by table-speciﬁc predicates,

226

J. Klein and S. Braun Directl oading: single table

Direct loading: chains

10

35 with 2 tables with 3 tables with 4 tables with 5 tables

single table 30 8

Tine [s]

Time [s]

25 6

4

20 15 10

2 5 0

0 0

500

1000

1500

2000

2500

3000

0

500

Number of loaded records Direct loading: trees, 2 outgoing RCCs per table

1000

1500

2000

2500

3000

Number of loaded records Direct loading: homgeneousc ycles

10

35 Height=2, 3 tables Height=3, 7 tables

with 2 tables with 3 tables with 4 tables with 5 tables

30

8

Time [s]

Time [s]

25 6

4

20 15 10

2 5 0

0 0

500

1000

1500

2000

2500

3000

0

Number of loaded records

500

1000

1500

2000

2500

3000

Number of loaded records

Fig. 5. Results for direct loading: single node, chains, trees, homogeneous cycles

which reﬂect the RCC dependencies of the table in the cache group, from the backend DBMS. For each table involved, the record set delivered is inserted observing the bottom-up order, thereby dropping duplicate records. While cache group CG1 in Fig. 4 can be loaded by a simple backend request, i. e., Q1: select * from T 1B where T 1B .a = ‘v’, CG2 and CG3 obviously need three load requests. Although CG4 consists of only a single atomic zone, up to three requests are necessary to load all tables participating in the cycle. Essentially, the table maintenance cost is caused by the predicate complexity required to specify the records to be inserted. While insertion into T 1 of CG1 in Fig. 4 is very cheap, the records, e. g., to be inserted into CG2 have to be evaluated by three queries: ﬁlling T 1 is similar to Q1 above, whereas the queries Q2 and Q3 for T 2 and T 3, respectively, are more complex: Q2: select * from= T 2B where T 2B .a in (select T 1B .a from T 1B where T 1B .a = ‘v’ ) Q3: select * from T 3B where T 3B .a in (select T 2B .c from T 2B where T 2B .a in (select T 1B .a from T 1B where T 1B .a = ‘v’ )). In the example, the inherent complexity is needed to determine all join partners for the CU. When inserting records with value T 1B .a = ‘v’ into T 1 of CG2, Q2 delivers all join partners needed in T 2 for T 1 (to satisfy RCC T 1.a → T 2.a) and, in turn, Q3 those in T 3 for T 2. Apparently, an RCC chain of length n requires n − 1 joins and one selection.

Optimizing Maintenance of Constraint-Based Database Caches

227

Measurement results. Our experiments reported in Fig. 5 correspond to the cache group types sketched in Sect. 3.2 and primarily address scalability of the load method. In each case, we continuously increased the number k of records to be loaded up to 3000. Because of the simple selection predicate in CG1 and the missing need for duplicate elimination, Fig. 5a scales very well and the cost involved in selecting, comparing, and inserting of data was hardly recognizable in the entire range explored. The remaining experiments were coined by the counter-eﬀect of smaller result sets per table and more load queries with more complex predicates to be evaluated. When n tables were involved, the load method had to select ∼ k/n records per table. In Fig. 5b, e. g., the experiments for k = 3000 and chains of length 2, 3, 4, and 5 were supplied by 1500, 1000, 750, and 600 records per table, respectively. While the load times quickly entered a range unacceptable for online transaction processing, the existence of cycles augmented this negative performance impact once more. In summary, if the amount to be loaded is higher than several hundred records, direct loading cannot be applied. Hence, a new method called indirect loading was designed to avoid these problems encountered. 4.2

Indirect Loading

Indirect loading reduces the high selection costs of direct loading using so-called shadow tables. Before making a requested CU available for query evaluation, it is entirely constructed at the cache side and then merged with the actual cache content. This proceeding allows arbitrary CU construction asynchronously to normal cache query processing. Therefore, it implies much simpler predicates of load queries, because the CU fractions of the participating atomic zones can be loaded top-down, for which simple selections on single backend tables are suﬃcient. For each cache table, a corresponding shadow table (indicated through a subscripted S) with identical column deﬁnitions is created which holds the collected record of a requested CU (see Fig. 6b). Before these records are merged bottom-up, the preceding top-down working collection is implemented through a simple recursive algorithm based on so-called propagation tables (PT). These tables, deﬁned for each RCC, consist

Fig. 6. New concepts: propagation tables (a), shadow tables (b)

228

J. Klein and S. Braun Comparison (direct/indirect): homogeneousc ycle, 3 tables

Comparison (direct/indirekt): Chain, 3 tables

35

25 direct loading indirect loading

direct loading indirect loading

30

20

Time [s]

Time [s]

25 15

10

20 15 10

5

5 0

0 0

500

1000

1500

2000

Number of loaded records

2500

3000

0

500

1000 1500 2000 Number of loaded records

2500

3000

Fig. 7. Improvements achieved by indirect loading

of only a single column and control the propagation of distinct RCC source values (also denoted as control values) to be loaded to the shadow tables.3 We denote the values propagated trough PTs as propagation values. To load a CU (see Fig. 6a), its ﬁll value v is inserted into f tab(T 1).id. The PT attached to RCC f tab(T 1).id → T 1.b obtains value v and trigger value completeness for it in T 1.b. In turn, newly loaded control values in T 1 are again propagated along PTs of outgoing RCCs. As long as propagation values are present in PTs, the respective records are collected according to the principles described in Sect. 3. The process stops if all control values are satisﬁed, i. e. if all propagation values are processed/consumed. Subsequently, the freshly loaded CU is merged in bottom-up fashion with the related cache tables thereby eliminating duplicate records and observing all RCCs. Comparison: direct/indirect loading. Fig. 7 measures the performance of direct and indirect loading for two cache group types. We have empirically compared again those cache group types (chains and cycles) which achieved worst performance with the previous load method. Indirect loading was primarily designed to solve such performance problems and, indeed, the results are quite clear: In both cases, the costs involved for indirect loading were often lower than one second and did not exceed 3 seconds. Note, because CU preparation in shadow tables is aynchonous, only short locks are necessary for the merging phase. Therefore, the timings are acceptable, because concurrent queries are not severely hindered. The performance reached for loading seems to be further improvable: So far, both methods are executed by the cache DBMS. This means that record selection needs multiple requests to the backend DBMS and, in turn, is burdened by multiple latencies between cache and backend. Therefore, the so-called prepared loading tries to avoid these disadvantages. 4.3

Prepared Loading

This method entirely delegates the collection of CU records to the backend DBMS. As a prerequisite, the backend DBMS needs to maintain additional 3

In Sect. 5, PTs are also used to control propagation of unloading from cache tables.

Optimizing Maintenance of Constraint-Based Database Caches

229

metadata about the cache groups supplied. The cache manager requests the data for a new caching unit by sending the corresponding ﬁll value to the backend. The way the data is collected is similar to indirect loading, but happens at the backend. A prepared CU is then packaged and transferred to the cache. Cache merging only has to observe uniqueness of records. Because the goal of caching is usually to oﬀ-load the server DBMS, this “optimization” partly yields the opposite and requires the server to maintain cachesided metadata and to perform extra work. This method, however, may be very useful in case of cache groups having n atomic zones and high latency between cache and backend, because only a single data transfer to the cache is needed instead of n (and even more in the presence of cycles). Eﬀects of latency, however, are not considered in this paper.

5

Unloading of Caching Units

After having explored various options for cache loading, we consider selective unloading of cached data. Note, keeping unused data in the cache increases maintenance costs to preserve consistency and freshness without bringing beneﬁts in terms of reduced query response times. Therefore, it is important to control data references in the cache and to possibly react by removing ﬁll values and their implied CUs whose reference locality degraded. Of course, replacement algorithms in cache groups are more complicated than those for normal DB buﬀers. To unload a ﬁll value together with its CU, the atomic zones involved are traversed top-down (forward directed unloading), as sketched in Sect. 3.1. The control values to be deleted are propagated using the same PTs already introduced in Sect. 4.2. Note, because records in a CU may be shared by other CUs, actually only the CU diﬀerence must be removed, which implies checking whether or not the records considered for replacement exclusively belong to the CU to be unloaded. In the following, we outline our replacement provisions, before we describe unloading in a trivial atomic zone and the more complex procedure for non-trivial atomic zones. 5.1

Replacement Policy

We used as replacement policy the well known LRU-k algorithm, for which we record the timestamps of the k most recent references to a CU in the related control table (see [12]). The replacement decision for CUs refers to extra information recorded in each control table. The ﬁrst is a high-water mark concerning the number of related caching units to be simultaneously present in the cache. The second characterizes the minimum ﬁll level observed for CU unloading. The current ﬁll level is approximated by the number of rows in the control table (the number of CUs actually loaded) divided by the number of candidate values. When the ﬁll level reaches the high-water mark, a delete daemon has to remove records to make room in the cache. Such a strategy allows to separately control the cache space dedicated for each ﬁlling column, which enables ﬁne-tuning of

230

J. Klein and S. Braun

Fig. 8. Unloading in trivial atomic zones

locality support and is much more adaptive compared to a single occupancy factor assigned to the whole cache group. 5.2

Unloading in Trivial Atomic Zones

Consider the cache group fragment shown in Fig. 8. The given processing sequence for atomic zones (see Sect. 3.1) ensures that the related PTs of incoming RCCs obtain all propagation values if control values were removed from preceding atomic zones. In the example, deletion in T 1 is initiated by value v propagated through P T1 . Hence, value v deﬁnes the starting point for the deletion process in table T 1. To determine the deletable set of records, each record with T 1.b = ‘v’ is to be checked whether or not it is eligible, i. e., whether other control values do not enforce its presence in the cache. In our example, all records can be deleted which are not restrained by control values of RCC2 . Therefore, if only the control value 1000 is present in S.a (the source column of RCC2 ), all records σ(b=‘v’ ∧ d=1000) T 1 can be deleted. Thus, deletion within a trivial atomic zone can be performed with a single delete statement. The following statement Q4 removes all deletable records from AZ1 , where all incoming PTs (in our case P T1 and P T2 ) are observed. Q4: delete from T 1 where (b in (select CV from P T1 ) or d in (select CV from P T2 )) and (b not in (select R.a from R) and d not in (select S.a from S)) As indicated by our cache group examples in Fig. 4, most tables are encapsulated in trivial atomic zones. Because unloading of them can be achieved by considering each atomic zone in isolation (thereby observing the top-down processing sequence), this maintenance task remains simple and can be eﬀectively performed. In rare cases, however, removal of records becomes more complicated if they are involved in cyclic RCC references. Such a case will be discussed in the following.

Optimizing Maintenance of Constraint-Based Database Caches

5.3

231

Unloading in Non-trivial Atomic Zones

We now consider in detail the problems sketched in example 3.1, where we have diﬀerentiated internal from external dependencies. To explain their eﬀects and to resolve them, we analyze unloading in the homogeneous cycle shown in Fig. 9. The algorithm proceeds in two phases: global deletion and internal deletion. We denote the values in columns which form a homogeneous cycle as cycle values. To resolve their dependencies as fast as possible, the key idea is to initially ﬁnd all cycle values whose records are not involved in external dependencies. In Fig. 9, these are all records having value 1000, because values 2000 and 3000 are assumed to be externally referenced through R.a = ‘x’ and S.a = ‘j’. After deletion of all records with cycle value 1000 (in phase global deletion), the atomic zone just holds records which have no more dependencies (neither internal nor external) or records which could not be deleted due to an existing external dependency. Hence, the internal cyclic dependencies are eliminated if this was possible. The remaining records are deleted within the second phase called internal deletion, which is also performed in forward direction (similar to trivial atomic zones), but only analyzes the tables within the non-trivial atomic zone. As a consequence, the records having value 3000 in T 1 and T 2 are deleted in our example; the corresponding record in T 3, however, cannot be deleted due to the external dependency present. Subsequently, we consider the two phases in detail. Global deletion. Refer to Fig. 9 and assume that the value v needs to be deleted as indicated. To ﬁnd the cycle values whose records are globally deletable, a join between all tables having incoming external RCCs is performed. In our example, these are the tables T 1 and T 3. Hence, query Q5 returns the deletable cycle values: Q5: select T 1.b from T 1, T 3 where T 1.b = T 3.a and T 1.a in (select CV from P T1 ) and T 1.a not in (select R.a from R) and T 3.b not in (select S.a from S) The example shows that it is suﬃcient to perform the dependency check via the control values of incoming external RCCs. Because cyclic internal dependencies cannot be violated, such cyclic dependencies are not observed in this phase. Hence, this approach exploits the fact that the cycle values can be joined within an homogeneous cycle. When control values are aﬀected during the deletion of records which hold the corresponding cycle values (in our example the records having value 1000), they have to be propagated only along external, outgoing RCCs (e. g. RCC3 in Fig. 9 using their PTs to continue deletion in subsequent atomic zones). In Fig. 9, this is necessary for value 11 which is completely removed from T 2.a. Internal deletion. Internal deletion is performed in a similar way as unloading in trivial atomic zones. Starting at a table with a PT value attached, all incoming RCCs (external and internal RCCs) are checked to ﬁnd deletable records. Such

232

J. Klein and S. Braun

Fig. 9. Unloading in non-trivial atomic zones

records are removed and all aﬀected control values are propagated along all outgoing RCCs (using their PTs). The internal deletion ends if there is no PT anymore holding propagation values for the related AZ. In our example, the process stops in table T 3, because the value 3000 still has external dependencies. 5.4

Measurement Results

Fig. 10 illustrates the times needed to unload speciﬁc caching units. In all cache group types, the unloading process, using SQL statements already prepared, was very eﬃcient (typically much faster than 200 ms). Only the initial statement Unloading: single table

Unloading: chains

200

200 with 2 tables with 3 tables with 4 tables with 5 tables

single table

150 Time [ms]

Time [ms]

150

100

50

100

50

0

0 0

500

1000

1500

2000

2500

3000

0

500

1000

Number of loaded records

2000

2500

3000

200

200

with 2 tables with 3 tables with 4 tables with 5 tables

Height=2, 3 tables Height=3, 7 tables 150 Time [ms]

150 Time [ms]

1500

Number of loaded records Unloading: homogeneousc ycles

Unloading: trees, 2 outgoing RCCs per table

100

100

50

50

0

0 0

500

1000

1500

2000

Number of loaded records

2500

3000

0

500

1000

1500

2000

Number of loaded records

Fig. 10. Unloading of cache units

2500

3000

Optimizing Maintenance of Constraint-Based Database Caches

233

preparation included in the ﬁrst measurements (selecting 100 records) caused comparatively high costs. These costs are also included in preceding measurement results (see Fig. 5) where, however, this minor cost factor is insigniﬁcant for the times measured. The execution time consumed to unload a CU within an homogeneous cycle is similar to that needed to unload chains, which illustrates that we are now also able to unload homogeneous cycles with acceptable performance.

6

Conclusion

CbDBC supports declarative query processing close to applications. The cache constraints to be applied pose particular challenges for cache management and maintenance. With the help of the methods and algorithms presented, it is now possible to selectively load and unload caching units eﬃciently (also in homogeneous cycles). Starting from the performance problems caused by direct loading, we introduced a new method called indirect loading, which improves cache maintenance dramatically. When latency is too high, preparation of caching units within the backend DBMS could relieve the delays implied by the loading process. Finally, we presented a novel unloading mechanism, which is also able to handle unloading of homogeneous cycles. Supported by a variety of empirical measurements, we conﬁrmed that acceptable maintenance eﬃciency can be reached for all important cache group types.

References 1. Altinel, M., Bornh¨ ovd, C., Krishnamurthy, S., Mohan, C., Pirahesh, H., Reinwald, B.: Cache tables: Paving the way for an adaptive database cache. In: VLDB Conf., pp. 718–729 (2003) 2. Amiri, K., Park, S., Tewari, R., Padmanabhan, S.: DBProxy: A dynamic data cache for web applications. In: ICDE Conf., pp. 821–831 (2003) 3. Anton, J., Jacobs, L., Liu, X., Parker, J., Zeng, Z., Zhong, T.: Web caching for database applications with Oracle Web Cache. In: SIGMOD Conf., pp. 594–599 (2002) 4. Bello, R.G., Dias, K., Downing, A., Feenan Jr., J.J., Finnerty, J.L., Norcott, W.D., Sun, H., Witkowski, A., Ziauddin, M.: Materialized views in Oracle. In: VLDB Conf., pp. 659–664 (1998) 5. B¨ uhmann, A., H¨ arder, T., Merker, C.: A middleware-based approach to database caching. In: Manolopoulos, Y., Pokorn´ y, J., Sellis, T.K. (eds.) ADBIS 2006. LNCS, vol. 4152, pp. 184–199. Springer, Heidelberg (2006) 6. Goldstein, J., Larson, P.: Using materialized views: A practical, scalable solution. In: SIGMOD Conf., pp. 331–342 (2001) 7. H¨ arder, T., B¨ uhmann, A.: Value complete, column complete, predicate complete – Magic words driving the design of cache groups. The VLDB Journal 17(4), 805–826 (2008) 8. Keller, A., Basu, J.: A predicate-based caching scheme for client-server database architectures. The VLDB Journal 5(1), 35–47 (1996)

234

J. Klein and S. Braun

9. Larson, P., Goldstein, J., Guo, H., Zhou, J.: MTCache: Mid-tier database caching for SQL server. Data Engineering Bulletin 27(2), 35–40 (2004) 10. Larson, P., Goldstein, J., Zhou, J.: MTCache: Transparent mid-tier database caching in SQL server. In: ICDE Conf., pp. 177–189 (2004) 11. Levy, A.Y., Mendelzon, A.O., Sagiv, Y., Srivastava, D.: Answering queries using views. In: PODS Conf., pp. 95–104 (1995) 12. O’Neil, E.J., O’Neil, P.E., Weikum, G.: The lru-k page replacement algorithm for database disk buﬀering. In: SIGMOD Conf., pp. 297–306 (1993) 13. Oracle Corporation: Internet application server documentation library (2008), http://www.oracle.com/technology/documentation/appserver.html 14. The TimesTen Team: Mid-tier caching: The TimesTen approach. In: SIGMOD Conf., pp. 588–593 (2002)

The Onion-Tree: Quick Indexing of Complex Data in the Main Memory Caio C´esar Mori Car´elo1, Ives Renˆe Venturini Pola1 , Ricardo Rodrigues Ciferri2 , Agma Juci Machado Traina1 , Caetano Traina-Jr.1, and Cristina Dutra de Aguiar Ciferri1 1

Departamento de Ciˆencias de Computa¸ca ˜o, Universidade de S˜ ao Paulo 13.560-970, S˜ ao Carlos - SP, Brazil {ccarelo,ives,agma,caetano,cdac}@icmc.usp.br 2 Departamento de Computa¸ca ˜o, Universidade Federal de S˜ ao Carlos 13.565-905, S˜ ao Carlos - SP, Brazil [email protected]

Abstract. Searching for elements in a dataset that are similar to a given query element is a core problem in applications that use complex data, and has been carried out aided by a metric access method (MAM). A growing number of these applications require indices that can be built faster and for several times, in addition to providing smaller response times for similarity queries. Besides, the increase in the main memory capacity and its lowering costs also motivate using memory-based MAMs. In this paper, we propose the Onion-tree, a new and robust dynamic memory-based MAM that performs a hierarchical division of the metric space into disjoint subspaces. The Onion-tree is very compact, requiring a small fraction of the main memory (e.g., at most 4.8%). Comparisons of the Onion-tree, a memory-based version of the Slim-tree, and the memory-based MM-tree showed that the Onion-tree always produced the smallest elapsed time to build the index. Our experiments also showed that the Onion-tree produced the best query performance results, followed by the MM-tree, which in turn outperformed the Slim-tree. With regard to the MM-tree, the Onion-tree provided a reduction in the number of distance calculations that ranged from 1% to 11% in range queries and from 16% up to 64% in k-NN queries. The Onion-tree also signiﬁcantly improved the required elapsed time, which ranged from 12% to 39% in range query processing and from 40% up to 70% in k-NN query processing, as compared to the MM-tree, its closest competitor. The Onion-tree source code is available at http://gbd.dc.ufscar.br/ download/Onion-tree. Keywords: metric access method, complex data, similarity search.

1

Introduction

A metric access method (MAM) is designed aiming at providing eﬃcient access to the growing number of applications that demands to compare complex data, J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 235–252, 2009. c Springer-Verlag Berlin Heidelberg 2009

236

C.C.M. Car´elo et al.

such as images, audio and video. To improve complex data access, MAMs reduce the search space, leading the search to portions of the dataset where the stored elements probably have higher similarity with a given query element. A similarity measure between two elements can be expressed as a metric that becomes smaller as the elements are more similar [1]. Therefore, MAMs partition the metric space into subspaces so that queries do not have to access the complete dataset. Formally, a metric space is an ordered pair < S, d >, where S is the domain of data elements and d : S × S → R+ is a metric. For any s1 , s2 , s3 ∈ S, the metric must hold the following properties: (i) identity: d(s1 , s1 ) = 0; (ii) symmetry: d(s1 , s2 ) = d(s2 , s1 ); (iii) non-negativity: d(s1 , s2 ) ≥ 0; and (iv) triangular inequality: d(s1 , s2 ) ≤ d(s1 , s3 ) + d(s3 , s2 ) [2]. For instance, elements of a dataset S ⊂ S, which may be represented by numbers, vectors, matrices, graphs or even functions, can be indexed with a MAM using metrics such as the Manhattan (L1 ) or the Euclidean (L2 ) distances [3]. Searching for elements close to a given query element sq ∈ S is a core problem in applications that manage complex data. The two most useful types of similarity queries are the range and the k-nearest neighbor (k-NN) queries, which are deﬁned as follows: • Range query: given a query radius rq , this query retrieves every element si ∈ S that satisﬁes the condition d(si , sq ) ≤ rq . An example is: “Select the images that are similar to the image P by up to ﬁve similarity units”. • k-NN query: given a quantity k ≥ 1, this query retrieves the k elements in S that are the nearest from the query center sq . An example is: “Select the three images most similar to the image P ”. There are disk-based [4,5,6,7,8] and memory-based MAMs [9,10,11,12]. Memorybased MAMs are useful for applications that require to build indices several times, in a very fast way. For instance, they are applied to optimize subqueries in the processing of complex queries. Furthermore, memory-based MAMs do not need to minimize disk accesses as disk-based MAMs do. Thus, memory-based MAMs can provide better partitioning of the metric space, allowing similarity queries to be answered faster. Moreover, the increase in the main memory capacity and its lowering costs motivate the use of memory-based MAMs. The MM-tree [12] is the fastest memory-based MAM to date. However, the partitioning of the MM-tree may generate subspaces of very diﬀerent sizes, therefore producing highly unbalanced structures. Although Pola et al. [12] propose a policy to minimize this issue, it introduces an additional processing that takes quadratic time. This increases the cost of building the index, which is an important feature for memory-based MAMs. Furthermore, our preliminary experiments showed that the MM-tree is not suitable for high dimensional data. These drawbacks call for improvements on the MM-tree. In this paper, we propose a new and robust dynamic memory-based MAM that extends the MM-tree, called the Onion-tree. We validate the Oniontree through performance tests using datasets that exploits diﬀerent properties that aﬀect the general behavior of MAMs, such as data volume and data

The Onion-Tree: Quick Indexing of Complex Data

237

dimensionality. In the experiments, we compare the proposed Onion-tree with a memory-based version of the Slim-tree [5] and with the MM-tree. The distinctive properties of the Onion-tree are as follows: • A new partitioning method that controls the number of disjoint subspaces generated. This method allows for the creation of shallower and wider structures and does not impair the cost of building the index. • A technique that prevents the creation of subspaces that are too small or too big. This ensures a better division of the metric space, improving both the cost of building the index and the cost of query processing. • Extensions of the MM-tree’s range and k-NN algorithms to support the new partitioning method of the Onion-tree, which include a proper visit order of the subspaces in k-NN queries. This paper is organized as follows. Section 2 reviews related work, while Section 3 summarizes the MM-tree. Section 4 overviews the main characteristics of the proposed Onion-tree, and Sections 5 to 7 detail its properties. Section 8 discusses the experimental results and Section 9 concludes the paper.

2

Related Work

The pioneering work of Burkhard and Keller [13] introduces approaches to index data in metric spaces, while the works of Ch´avez et al. [2] and Hjaltason and Samet [1] survey existing MAMs. The work on MAMs is quite extensive. The GH-tree [9] is a static MAM that chooses recursively two pivots per node and deﬁnes a generalized hyperplane between them, creating two subspaces. Each remaining element is assigned to the subspace of the closest pivot. The GNAT [10] tree extends the GH-tree to choose m ≥ 2 pivots per node, creating m subspaces. Conversely, the partitioning of the VP-tree [11] selects a representative element and deﬁnes a ball with covering radius r, which is the average distance between all the elements. The remaining elements are associated to the left subtree if their distances are less or equal than r or to the right subtree otherwise. Diﬀerently from these static MAMs, the Onion-tree is a dynamic MAM. So, it supports further insertions without compromising its structure. The ﬁrst dynamic disk-based MAM is the M-tree [4]. Its leaf nodes store all the elements, while its internal nodes store elements called representatives, which have a covering radius and are chosen by promotion algorithms. The Slim-tree [5] is the ﬁrst index explicitly designed to reduce the degree of overlap between nodes in a metric tree, improving query performance. The OMNI concept [8] allows to store distances between indexed elements and strategically positioned elements (foci), which are used to improve the prunability during the query processing. The PM-tree [7] proposes the use of OMNI concept to reduce the M-tree node space search. The DBM-tree [6] minimizes the overlap between highdensity nodes by relaxing the height-balancing rule. Diﬀerently from these diskbased MAMs, the Onion-tree is a memory-based MAM that deﬁnes disjoint regions, allowing similarity queries to be answered without node overlapping.

238

C.C.M. Car´elo et al.

To the best of our knowledge, the MM-tree [12] is the fastest memory-based MAM in the literature. The experiments described in [12] showed that the MM-tree always outperformed a memory-based version of the Slim-tree, which in turn outperformed the VP-tree. The MM-tree is summarized in Section 3, since the Onion-tree extends this structure.

3

The MM-Tree

The MM-tree [12] is a memory-based MAM that divides the metric space into four disjoint regions by selecting two pivots per node. Fig. 1a shows the structure of a MM-tree that indexes the elements of S = {s1 , s2 , s3 , s4 , s5 , s6 , s7 , s8 }, using {s1 , s2 } as pivots. The distance r between the pivots (i.e., d(s1 , s2 )) deﬁnes the ball radius of each pivot, creating four disjoint regions: I, II, III and IV. Each element of S − {s1 , s2 } is assigned to a speciﬁc region according to Table 1. For instance, the element s5 is assigned to region II, since d(s5 , s1 ) < r and d(s5 , s2 ) ≥ r. The MM-tree is built recursively, and requires only two distance calculations per level to determine the region of an element. The MM-tree may generate highly unbalanced structures. To overcome this issue, it uses a semi-balancing technique that is applied when a new element is assigned to a leaf node that is full, but has siblings with space to hold the new element. The technique attempts to replace the pivots on the parent node, avoiding creating a new level. Fig. 1b exempliﬁes the semi-balancing technique. The MM-tree’s algorithms for range and k-NN queries detect overlaps between the query ball and the balls of a node. Unlike the range query that already presents a query radius, the k-NN query uses an active radius that starts with the maximum distance and decreases as nearest elements are found.

(a)

(b)

s8 s1 s3 s4

s5 s6

II

I

s8

s2 III

s5

s7 s6

IV

s2

s1 s3 s4

II

I

s7

III

IV level 0 level 1

s8

s3 s4 s5

s1 s2

level 0

s6 s7

level 1

s3 s1 s6

s5 s4 s2 s8 s7

level 2

Fig. 1. (a) Example of a MM-tree. (b) Example of the semi-balancing technique.

The Onion-Tree: Quick Indexing of Complex Data

239

Table 1. Regions of space where the element si can be assigned

4

d(si , s1 ) θ r

d(si , s2 ) θ r

Region

< < ≥ ≥

< ≥ < ≥

I II III IV

The Proposed Onion-Tree

In this section, we describe the Onion-tree, a new and robust dynamic memorybased MAM that extends the MM-tree. Like the MM-tree, the Onion-tree also divides the metric space into disjoint regions by using two pivots per node. However, the Onion-tree can divide the metric space into more than four disjoint regions per node. The Onion-tree introduces the following properties: • Expansion procedure: a method that increases the number of disjoint regions deﬁned by the pivots of a node. Experimental evidence suggests that a disadvantage in the MM-tree’s partitioning is the size of region IV, which is bigger than the size of regions I, II and III. This generates unbalanced structures because many elements are assigned to region IV. The MM-tree’s semi-balancing policy minimizes this disadvantage, but it degrades the cost of building the index due to an additional processing that takes quadratic time to determine the best-suited to be the new pivots. Conversely, the Onion-tree’s expansion procedure divides region IV to generate more balanced structures. This procedure creates shallower and wider structures and does not impair the elapsed time to build the Onion-tree, since it does not require additional distance calculations. Fig. 2a shows the expansion procedure, which is detailed in Section 5. • Replacement technique: a policy that may replace the pivots of a leaf node in the insertion operation. In the Onion-tree, if the pivots are too close, many expansion procedures are applied to the node. Alternatively, if the pivots are too distant, no expansion procedure is done and most of the node’s subspace is assigned to region I. The Onion-tree’s replacement technique minimizes these cases, ensuring a hierarchical division of the metric space. This technique takes constant time and does not require additional distance calculations. Fig. 2b shows the replacement of the pivot s1 with the element si . The replacement technique is detailed in Section 6. • Extended query algorithms: extensions of the MM-tree’s range and k-NN algorithms to support the additional regions created by the expansion procedure. Furthermore, we added to the Onion-tree’s k-NN query a sequence order to visit the disjoint regions of a node (e.g., Fig. 2c), which improves the prunability. The extended algorithms are detailed in Section 7.

240

C.C.M. Car´elo et al.

(a)

(b)

Inserted element

(c)

Si

II I III IV

S1

S2 Sq

V

II I III VI IV VII

Replacement:

V

II I III VI IV

S1

Si S2

VII

Visit order: VII, V, IV, VI, II, I, and III

Fig. 2. Overview of the Onion-tree’s distinctive properties Table 2. The structure of an Onion-tree’s node N Symbol N.s1 N.s2 N.r N.Expansion N.Region N.F N.Son[1 . . . N.Region]

Description ﬁrst pivot second pivot distance between the pivots (i.e., radius) number of expansions number of regions link to the parent node links to the node’s regions

To support the aforementioned properties, the structure of each node N of the Onion-tree is composed of the attributes described in Table 2.

5

The Expansion Procedure

The expansion procedure divides recursively the external region of a node into four regions. Each expansion procedure adds three regions to the node, since the previous external region becomes the ﬁrst region of the expansion. Fig. 3 shows two expansion procedures applied to an node N , which is initially divided into regions I, II, III and IV. The ﬁrst procedure (i.e., expansion 1) generates a node N ’ with seven regions: regions IV, V, VI and VII in addition to regions I, II and III of expansion 0 (i.e., initial conﬁguration without expansion). The previous external region IV, therefore, is divided into regions IV to VII. The second procedure (i.e., expansion 2) produces a node N ” with ten regions: regions VII, VIII, IX and X plus regions I, II and III of expansion 0 and regions IV, V and VI of expansion 1. Each expansion procedure increases the pivots radius by r. This ensures the division of only the external region of a node. In Fig. 3, the radius in the ﬁrst expansion is 2N.r, while the radius in the second one is 3N.r.

The Onion-Tree: Quick Indexing of Complex Data

241

N’’ N’ N 2N.r

N.r II

I

III

IV

V

3N.r

I II

III

VI

IV VII

I V

VIII

II

III IV

VI IX

VII X

I II III IV

I II III IV V VI VII

I II III IV V VI VII VIII IX X

Fig. 3. Nodes N , N ’ e N ”, their pivots and regions

Two additional aspects related to the use of expansions are deﬁning how many expansions should be applied to a node (Section 5.1) and the identiﬁcation of the region to which an element should be assigned to (Section 5.2). 5.1

Creating Expansions and Regions

The Onion-tree proposes two policies to determine the number of expansions that can be applied to its nodes: fixed and variable. The ﬁxed expansion applies the same number of expansions to each node, according to an input parameter. For instance, the number of expansions equal to one establishes that all the nodes have seven regions. Conversely, the variable policy applies different numbers of expansions to each node. Therefore, a node can have ten regions (i.e., two expansions), while another can have four regions (i.e., zero expansion). The objective of the variable policy is to keep the external region small. Thus, the strategy adopted by this policy deﬁnes that expansions are needed only when the radius of the current node is less than half of the radius of the parent node. Otherwise, the node’s regions already cover its subspace and there is no need for further expansions. We call this approach keep-small strategy. The CreateRegions algorithm (Algorithm 1) determines the number of expansions that should be applied to a node N , as a function of an input integer E. If E > 0, the policy is set to fixed and the value of E is used as the number of expansions (line 1). Otherwise, the policy is set to variable and the number of expansions is calculated by the keep-small strategy (lines 2 to 4). Finally, the algorithm determines the number of regions of N (line 5).

242

C.C.M. Car´elo et al.

Algorithm 1. CreateRegions (N, E) 1 2 3 4 5

5.2

if E > 0 then N.Expansion ← E; else if N.r ≥ N.F.r then N.Expansion ← 0; 2 else N.Expansion ← N.F.r ; N.r N.Region ← (N.Expansion × 3) + 4;

Choosing a Region

The ChooseRegion algorithm (Algorithm 2) determines the region to which an element si is assigned to. Its inputs are a node N (as deﬁned in Table 2), whose regions were deﬁned by the CreateRegions algorithm, and d1 and d2 , which are the distances of si to the pivots of N . The algorithm analyzes each region to determine the one that encompasses d1 and d2 (lines 2 to 6). If no region is found, si is associated to the external region of N (lines 7 to 9).

Algorithm 2. ChooseRegion (N, d1 , d2 ) 1 2 3 4 5 6 7 8 9 10

6

Expansion ← 0; Region ← 0; R ← 0; //auxiliary variables for Expansion ← 0 to N.Expansion do R ← R + N.r; if d1 < R and d2 < R then Region ← 1; break; if d1 < R and d2 ≥ R then Region ← 2; break; if d1 ≥ R and d2 < R then Region ← 3; break; if Region = 0 then Expansion ← Expansion − 1; Region ← 4; return (Expansion × 3) + Region;

The Replacement Technique

The replacement technique is applied before the insertion of an element si into a full leaf node. It determines if the subspace of a node N is better partitioned if any of its pivots (e.g., s1 ) changes position with si . This technique also replaces the radius of N with the distance between si and the non-chosen pivot (e.g., s2 ). Fig. 4 shows an example of the replacement technique. In this example, si changes position with s1 , and r and d2 are replaced properly. The Replace algorithm (Algorithm 3) determines whether the insertion of an element si into a node N requires a replacement, using the distances between si and each node’s pivot (i.e., d1 and d2 ). First, the algorithm uses the keep-small strategy to determine the value of α, so that no expansion procedure will be applied if N.r = α (line 1). Then, it calculates the absolute values of the diﬀerences between α and the distances N.r, d1 and d2 (line 2). If d1 is the closest value to α, the pivot N.s2 is replaced with si (line 4). Alternatively, if d2 is the

The Onion-Tree: Quick Indexing of Complex Data

si

d1 s1

r

s2

d2

d1

si d2

s2

243

s1 r

Fig. 4. Example of the replacement technique

closest value to α, N.s1 is replaced with si (line 6). When the algorithm ends, the distance between the pivots is the closest one to α. Algorithm 3. Replacement (N, si , d1 , d2 ) 1 2 3 4 5 6

α ← N.F.r ; //the keep-small strategy 2 λr ← |N.r − α|; λ1 ← |d1 − α|; λ2 ← |d2 − α|; if λ1 < λ2 then if λ1 < λr then Replace (N.s2 , N.r) with (si , d1 ); else if λ2 < λr then Replace (N.s1 , N.r) with (si , d2 );

The replacement technique is used in Algorithm 4 to insert an element si into a node N as follows. If the node is empty, si is inserted as its ﬁrst pivot (line 1), and if the node has only one element, si is inserted as its second pivot (line 3). Otherwise, the algorithm ﬁrst calculates the distances between si and each pivot (lines 5 and 6). Then, it checks whether N is a leaf node, and if so, calculates the radius of N (line 8), veriﬁes if si should be replaced with one of the pivots (line 9), and deﬁnes the number of expansions to be applied to N (line 10). Finally, the insertion continues recursively in the subtree deﬁned by the ChooseRegion algorithm (lines 11 and 12).

Algorithm 4. Insert (N, si ) 1 2 3 4 5 6 7 8 9 10 11 12

if N.s1 = null then N.s1 ← si ; else if N.s2 = null then N.s2 ← si ; else d1 ← d(si , N.s1 ); d2 ← d(si , N.s2 ); if isLeaf [N ] then N.r ← d(N.s1 , N.s2 ); Replacement (N, si , d1 , d2 ); CreateRegions (N ); Region ← ChooseRegion (N, d1 , d2 ); Insert (N.Son[Region], si );

244

7

C.C.M. Car´elo et al.

Extended Query Algorithms

The MM-tree’s range and k-NN queries deal with only four regions per node. For joint use with the Onion-tree, these queries were adapted to allow the search for elements in all the regions created by the expansion procedures (line 8 of Algorithm 5 and line 13 of Algorithm 6).

Algorithm 5. Range (N, sq , rq ) 1 2 3 4 5 6 7 8 9

if N = null then return; else d1 ← d(sq , N.s1 ); d2 ← d(sq , N.s2 ); if d1 ≤ rq then Add N.s1 to the Result; if d2 ≤ rq then Add N.s2 to the Result; for Region ← 1 to N.Region do if Query radius intersects region N.Son[Region] then Range (N.Son[Region], sq , rq );

Algorithm 6. KNN (N, sq , k, ra ) 1 2 3 4 5 6 7 8 9 10 11 12 13

if N = null then return; else d1 ← d(sq , N.s1 ); d2 ← d(sq , N.s2 ); if Result.Size() < k then ra ← ∞; //ra : active radius else ra ← Result[k].Distance; if d1 ≤ ra then Add N.s1 to the Result, keeping it sorted; if d2 ≤ ra then Add N.s2 to the Result, keeping it sorted; Order ← Visit order of sq on N ; for i ← 1 to N.Region do Region ← Order.N ext(); if Query radius intersects region N.Son[Region] then KNN (N.Son[Region], sq , k, ra );

A novelty introduced by the Onion-tree in the KNN algorithm is a new policy to choose the visit order of the regions of a node (lines 11 to 13). Although the best possible visit order depends on data distribution, we assume that the datasets have clusters and the elements are inserted randomly. This reﬂects the most common scenario regarding real world datasets. The proposed policy works as follows. First, visit the region of a node N where the query element sq lies. Then, visit the remaining regions of N according to their proximity to sq . The policy determines both the visit order of the expansions of N and the visit order of their regions. The expansions are visited in the following order: (i) expansion E to which sq is assigned to; (ii) expansions E − 1

The Onion-Tree: Quick Indexing of Complex Data

245

Table 3. Visit order of the regions Region of Sq mod 3

Condition

1 1 2 2 3 3 4 4

d1 ≤ d2 d 1 > d2 d2 − R ≤ R − d1 d2 − R > R − d1 d1 − R ≤ R − d2 d1 − R > R − d2 d 1 ≤ d2 d 1 > d2

1st 3E 3E 3E 3E 3E 3E 3E 3E

+1 +1 +2 +2 +3 +3 +4 +4

Visit order 2nd 3rd 3E + 2 3E + 3 3E + 1 3E + 4 3E + 4 3E + 1 3E + 2 3E + 3

3E + 3 3E + 2 3E + 4 3E + 1 3E + 1 3E + 4 3E + 1 3E + 1

4th 3E + 4 3E + 4 3E + 3 3E + 3 3E + 2 3E + 2 3E + 3 3E + 2

Visit order sq

V VIII

d1 s1 I

d2 s2

II

III IV VII

Expansion 1 regions IV, VI, V Expansion 0 regions I, III, II

VI IX

Expansion 2 regions VII, IX, VIII, X

X

Fig. 5. Visit order of the expansions and their regions for sq

and E + 1; (iii) expansions E − 2 and E + 2; and so on. Visiting an internal expansion (e.g., E − 1) before an external one (e.g., E + 1) leads to a faster active radius reduction, as internal expansions are smaller and have higher probability to contain closer elements. This order improves the prunability. With regard to the regions of the expansions, they are visited in the order speciﬁed in Table 3. In this table, the regions of E are deﬁned using the modulo 3 operator, since each expansion procedure adds three regions to the node, and R is the covering radius used in E (i.e., R = (E + 1) × N.r). Fig. 5 shows the visit order of a query element sq , which is assigned to region IV of a node with two expansions. According to the proposed policy, the visit order of the expansions is 1, 0 and 2. Furthermore, since sq is assigned to region IV and is closer to s2 (i.e., 4 modulo 3 = 1 and d1 > d2 ), the visit order of the regions is IV, VI and V (second line in Table 3).

246

C.C.M. Car´elo et al.

Table 4. Datasets used in the experiments Dataset

8

Elements

D

Description

Brazilian cities

5507

2

Color histograms

68025

32

KDD Cup 2008

102240

117

Geographical coordinates of Brazilian cities (www.ibge.gov.br). Color image histograms from the KDD repository of the University of California at Irvine (kdd.ics.uci.edu). Training dataset containing breast cancer suspicious regions (www.kddcup2008.com).

Experimental Results

The Onion-tree was analyzed through performance tests using datasets that exploits diﬀerent properties that aﬀect the general behavior of MAMs, such as data volume and data dimensionality. We used three datasets with data volumes of diﬀerent magnitude orders (i.e., 103 , 104 and 105 elements), and applied the metric L2 to compare the elements. Table 4 shows the characteristics of these datasets, where the dimensionality of the elements is represented by D. We compared two versions of the Onion-tree with a memory-based version of the Slim-tree and with the MM-tree. The ﬁrst version of the Onion-tree used the variable policy, while the second one used the fixed policy (Section 5.1). We call the former version V-Onion-tree and the latter one F-Onion-tree. For the F-Onion-tree, we applied seven expansions to its nodes. We experimentally found that this number of expansions is the best value among the values of one, three, ﬁve, seven and nine expansions for the selected datasets. The optimum number of expansions depends on the characteristics of the dataset, especially data distribution. The Slim-tree was built using the min-occupation and the minimum spanning tree policies with 50 elements per node, and the MM-tree was built using the semi-balancing technique. These are the best conﬁgurations of these structures according to their authors. Due to space limitations, we only present here the results of the Onion-tree, the MM-tree and the Slim-tree, since Pola et. al [12] has already compared the MM-tree with the VP-tree and showed that the MM-tree always outperformed the VP-tree. The Onion-tree was implemented in C++ using the arboretum framework (gbdi.icmc.usp.br/arboretum). The source codes of the Slim-tree and the MM-tree were obtained from this framework and compiled under the same settings. The experiments were conducted on a computer with a 2.4 GHz Intel Core 2 Duo Processor and 2 GB 1067 MHz DDR3 of main memory. In our tests, we analyzed the cost of building the index as well as the cost of the query processing. The number of distance calculations and the elapsed time were recorded for the former (Section 8.1). As for the latter, we recorded the average number of distance calculations and the elapsed time needed to process 500 random queries (Sections 8.2 and 8.3).

The Onion-Tree: Quick Indexing of Complex Data

8.1

247

Performance Results for Building the Index

For dynamic memory-based MAMs, the cost of building the index is one of the most important aspects. These MAMs are frequently used to optimize subqueries. Therefore, the index is often built several times, such as when a user interacts with an application that searches for similarity. Table 5 and Fig. 6 show the performance results when building the indices. The results indicated that the Onion-tree is a very compact index. Both the F-Onion-tree and the V-Onion-tree required a very small fraction of the available main memory. The size of the Onion-tree was about 100 MB (i.e., 4.8% of the main memory) for the biggest dataset (KDD Cup 2008 dataset), against the size of 223 MB of the Slim-tree (i.e., 10.8% of the main memory). Regarding the number of distance calculations, the Onion-tree produced the best performance results for the Brazilian cities and the Color histograms datasets. Compared to the MM-tree, the performance gain was of 41.46% and 18.36%, respectively, and compared to the Slim-tree, it was of 85.33% and 78.49%, respectively. Therefore, the Onion-tree provided a great reduction in the number of distance calculations. For the KDD Cup 2008 dataset, which has the highest dimensionality, the MM-tree required the smallest number of distance calculations. The F-Onion-tree, however, required only a slightly increase of 7.63%. As for the elapsed time, the Onion-tree always produced the smallest overhead to build the index. The MM-tree took more time because its semi-balancing technique is very costly (see Section 4 for details). Comparing the F-Onion-tree and the V-Onion-tree, the F-Onion-tree required less distance calculations than the V-Onion-tree for all the datasets. The

Slim-tree

MM-tree

V-Onion-tree

F-Onion-tree

Brazilian cities Color histograms KDD Cup 2008

358,236 5,135,030 5,727,850

89,783 1,352,870 2,075,798

68,443 1,498,686 2,455,598

52,555 1,104,375 2,234,211

0.125

0.105 0.070 0.035 0

0.063 0.026 0.031 e e e e tre -tre -tre -tre mM ion ion n n Sli M O F-O V-

(a) Brazilian cities

8.0

7.406

6.5 5.0 3.5 2.0

4.422 3.391

3.078

e e e e tre -tre -tre -tre mM ion ion n n Sli M O F-O V-

Total elapsed time (seconds)

0.140

Dataset

Total elapsed time (seconds)

Total elapsed time (seconds)

Table 5. Number of distance calculations to build the indices

36

34.125

32 28

25.640

24 20

24.188 23.984

e e e e tre -tre -tre -tre mM ion ion n n Sli M O F-O V-

(b) Color histograms

Fig. 6. Elapsed time to build the indices

(c) KDD Cup 2008

248

C.C.M. Car´elo et al.

diﬀerence in the performance results ranged from 9% to 26%. This gain is very important for high-cost metrics, for which the F-Onion-tree showed to be the best Onion-tree version. Similarly to the number of distance calculations, the F-Onion-tree produced the best results for the elapsed time to index the Color histograms and the KDD Cup 2008 datasets. However, the diﬀerence in the elapsed time of these two versions was very slight: at most 9%. We conclude that the F-Onion-tree generates less overhead than the V-Onion-tree to be built. As discussed in Section 1, memory-based MAMs do not need to minimize disk accesses, so they can provide better partitioning of the metric space. This conclusion was conﬁrmed in our experiments, since the Onion-tree required less distance calculations (from 60% to 85%) and was built faster (from 30% to 79%) than the Slim-tree. This is due to the memory-based Slim-tree applies the same partitioning technique of the original disk-based Slim-tree. 8.2

Performance Results for Range Query Processing

The radius of range queries ranged from 1% to 10% of the dataset radius (i.e., half of the largest distance among all pairs of dataset elements) for the Brazilian cities and the Color histograms datasets. As for the KDD Cup 2008 dataset, the radius of range queries was set to higher values, due to the curse of dimensionality [2]. For this dataset, the radius ranged from 31% to 40% of the largest distance among all the pairs of the query elements, and the queries recovered from 1% to 10% of the number of elements. Both the F-Onion-tree and the V-Onion-tree outperformed the MM-tree and the Slim-tree for all the datasets, with regard to the number of distance calculations and the elapsed time in query processing. As the MM-tree outperforms the Slim-tree, table 6 shows the performance gain of the Onion-tree compared with only the MM-tree, i.e., it compares how much faster the Onion-tree is than the MM-tree. While the V-Onion-tree obtained a slightly gain (from 1.38% to 2.54%) in the number of distance calculations, the F-Onion-tree obtained a higher gain (from 4.01% to 11.3%). Regarding the elapsed time for query processing, the Onion-tree also produced better results: the V-Onion-tree obtained a performance gain ranging from 15% up to 33%, while the F-Onion-tree obtained a gain ranging from 12% up to 39%. Comparing the two versions of the Onion-tree, the F-Onion-tree produced a slightly better performance than the V-Onion-tree for datasets using high-cost metrics, such as the Color histograms and the KDD Cup 2008 datasets. For these datasets, the high data dimensionality increased the cost of the distance calculations, which impaired the elapsed time and beneﬁted the F-Onion-tree. Fig. 7 shows the average number of distance calculations and the elapsed time for range queries, as a function of the radius. Note that Fig. 7a, 7c and 7d are in log scale for better results visualization. For all the datasets and the two performance measures, both versions of the Onion-tree outperformed both the MM-tree and the Slim-tree.

The Onion-Tree: Quick Indexing of Complex Data

249

Table 6. The Onion-tree’s performance gains (range queries) Distance calculations V-Onion-tree F-Onion-tree

Brazilian cities Color histograms KDD Cup 2008

2.54% 2.47% 1.83%

Average number of distance calculations

F-Onion-tree

1000

100

Radius 10

Elapsed time V-Onion-tree F-Onion-tree

8.16% 11.32% 4.01%

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

15.04% 32.98% 25.13%

Total elapsed time (seconds)

Dataset

0.300

0.225

0.150

0.075

Radius 0

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

1000

100

Radius 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

250

1.00

0.10

Radius 0.01

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

(c) Color histograms

(d) Color histograms

Average number of distance calculations

85000

Slim-tree

Total elapsed time (seconds)

10000

10.00

Total elapsed time (seconds)

(b) Brazilian cities

100000

Average number of distance calculations

MM-tree

V-Onion-tree

(a) Brazilian cities

10

75000

65000

55000

Radius 45000

12.59% 39.01% 26.69%

31%32%33%34%35%36%37%38%39%40%

195

140

85

Radius 30

31% 32% 33% 34% 35% 36% 37% 38% 39% 40%

(e) KDD Cup 2008

(f) KDD Cup 2008

Fig. 7. Range queries results

8.3

Performance Results for k-NN Query Processing

The gains achieved by the Onion-tree to execute k-NN queries are even better than to execute range queries. Table 7 shows the Onion-tree’s gains when compared to the MM-tree. The value of k ranged from 2 to 20, encompassing the most common values used when performing k-NN queries.

250

C.C.M. Car´elo et al. Table 7. The Onion-tree’s performance gains (k-NN queries) Distance calculations V-Onion-tree F-Onion-tree 16.17% 43.82% 64.26%

22.32% 42.02% 64.51%

Average number of distance calculations

F-Onion-tree

1000

100

10

k 2

4

6

8

10

12

14

16

18

20

40.15% 45.55% 70.24%

Total elapsed time (seconds)

Brazilian cities Color histograms KDD Cup 2008

Elapsed time V-Onion-tree F-Onion-tree

0.1500

Total elapsed time (seconds)

Dataset

14.0

0.1125

0.0750

0.0375

0

k 2

4

6

Average number of distance calculations

22750

15500

8250

k 2

4

6

8

10 12 14 16 18 20

Average number of distance calculations

Slim-tree

7.0

3.5

0

k 2

4

6

110000

80000

50000

k 4

6

8

8

10

12

14

16

18

20

16

18

20

(d) Color histograms

140000

2

10 12 14 16 18 20

10.5

(c) Color histograms

20000

8

(b) Brazilian cities

10 12 14 16 18 20

Total elapsed time (seconds)

MM-tree

V-Onion-tree

(a) Brazilian cities 30000

1000

38.83% 45.38% 67.85%

250

195

140

85

30

k 2

(e) KDD Cup 2008

4

6

8

10

12

14

(f) KDD Cup 2008

Fig. 8. k-NN queries results

Regarding the number of distance calculations, the V-Onion-tree produced a performance gain from 16% up to 64% with respect to the MM-tree, while the F-Onion-tree produced a performance gain from 22% up to 64%. Comparing the two versions of the Onion-tree, the V-Onion-tree was more eﬃcient than the F-Onion-tree, generating an impressive elapsed time gain that ranged from 40% up to 70% with regard to the MM-tree. This diﬀerence is related to the fact

The Onion-Tree: Quick Indexing of Complex Data

251

that, although the number of distance calculations required by both versions are similar, there are more regions per node to be visited in the F-Onion-tree. Fig. 8 shows the average number of distance calculations and the elapsed time for diﬀerent k values for k-NN queries. Fig. 8a is in log scale for better results visualization. The two versions of the Onion-tree outperformed the MM-tree and the Slim-tree. For the highest dimensionality dataset (i.e., the KDD Cup 2008 dataset), the MM-tree presented the worst results, surpassing the cost of the sequential search (i.e., more than 102,240 distance calculations). Using the expansion procedure, the replacement technique and the extended query algorithms allowed the F-Onion-tree and the V-Onion-tree to maintain good performance even over high dimensional data. Therefore, we conclude that the Onion-tree is a robust MAM to index complex data.

9

Conclusions and Future Work

In this paper, we propose the Onion-tree, a new and robust dynamic memorybased metric access method, which divides the metric space into several disjoint subspaces to index complex data. The Onion-tree introduces the following distinctive good properties. It is based on an expansion procedure that controls the number of disjoint subspaces generated in the index building. It also applies a replacement technique that prevents the creation of subspaces that are too small or too big. Furthermore, the Onion-tree proposes similarity search algorithms to support its new partitioning method without overlap, in addition to a diﬀerent visit order of the subspaces in k-NN queries. The Onion-tree was validated through performance tests that issued range and k-NN queries over datasets with diﬀerent data volumes and dimensionalities. The results showed that the Onion-tree is very compact, requiring a small fraction of the main memory. Comparisons of the Onion-tree, a memory-based version of the Slim-tree and the MM-tree indicated that the Onion-tree always required the smallest elapsed time to build the index. The query processing results also showed that the Onion-tree obtained the best performance results, followed by the MM-tree, which in turn outperformed the Slim-tree in almost all the experiments. The Onion-tree also required less distance calculations in query processing. Compared with the MM-tree, the Onion-tree reduction in the number of distance calculations ranged from 1% to 11% in range query processing, and from 16% up to 64% to answer k-NN queries. The Onion-tree also signiﬁcantly reduced the elapsed time in query processing. Compared with the MM-tree, the improvement ranged from 12% up to 39% in range query processing, and from 40% up to 70% in k-NN query processing. We also conducted experiments using two versions of the Onion-tree. The V-Onion-tree applied a variable expansion procedure policy, while the F-Onion-tree applyed a ﬁxed policy. In general, the F-Onion-tree slightly outperformed the V-Onion-tree. However, the use of diﬀerent policies ensures ﬂexibility to the Onion-tree. Moreover, the F-Onion-tree requires obtaining a proper value for the (ﬁxed) number of expansions. Therefore, the ﬁxed policy is indicated in situations where data are well known. On the other hand, the variable

252

C.C.M. Car´elo et al.

policy is well-suited for diﬀerent types of data distribution, since it automatically identiﬁes the need for expansions, and performs as expected. We are currently extending the Onion-tree to investigate diﬀerent replacement policies. In order to complement our investigation into datasets that exploits diﬀerent properties that aﬀect the general behavior of MAMs, we are also planning to run new experiments using diﬀerent metrics and other policies. Acknowledgments. This work has been supported by the following Brazilian research agencies: FAPESP, CNPq, CAPES, INEP and FINEP. The third author also thanks the Web-PIDE Project (Observatory of the Education).

References 1. Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003) 2. Ch´ avez, E., Navarro, G., Baeza-Yates, R.A., Marroqu´ın, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001) 3. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. of Artiﬁcial Intelligence Research 6, 1–34 (1997) 4. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An eﬃcient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997) 5. Traina Jr., C., Traina, A.J.M., Faloutsos, C., Seeger, B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE TKDE 14(2), 244–260 (2002) 6. Vieira, M.R., Traina Jr., C., Chino, F.J.T., Traina, A.J.M.: DBM-tree: A dynamic metric access method sensitive to local density data. In: SBBD, pp. 163–177 (2004) 7. Skopal, T., Pokorn´ y, J., Sn´ asel, V.: PM-tree: Pivoting metric tree for similarity search in multimedia databases. In: ADBIS (Local Proceedings) (2004) 8. Traina Jr., C., Santos Filho, R.F., Traina, A.J.M., Vieira, M.R., Faloutsos, C.: The omni-family of all-purpose access methods: a simple and eﬀective way to make similarity search more eﬃcient. VLDB J. 16(4), 483–505 (2007) 9. Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40(4), 175–179 (1991) 10. Brin, S.: Near neighbor search in large metric spaces. In: VLDB, pp. 574–584 (1995) 11. Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993) 12. Pola, I.R.V., Traina Jr., C., Traina, A.J.M.: The MM-tree: A memory-based metric tree without overlap between nodes. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690, pp. 157–171. Springer, Heidelberg (2007) 13. Burkhard, W.A., Keller, R.M.: Some approaches to best-match ﬁle searching. Communications of the ACM 16(4), 230–236 (1973)

Cost-Based Vectorization of Instance-Based Integration Processes Matthias Boehm1 , Dirk Habich2 , Steffen Preissler2 , Wolfgang Lehner2 , and Uwe Wloka1 1

Dresden University of Applied Sciences, Database Group {mboehm,wloka}@informatik.htw-dresden.de 2 Dresden University of Technology, Database Technology Group {dirk.habich,steffen.preissler,wolfgang.lehner}@tu-dresden.de

Abstract. The inefficiency of integration processes—as an abstraction of workflow-based integration tasks—is often reasoned by low resource utilization and significant waiting times for external systems. With the aim to overcome these problems, we proposed the concept of process vectorization. There, instance-based integration processes are transparently executed with the pipes-and-filters execution model. Here, the term vectorization is used in the sense of processing a sequence (vector) of messages by one standing process. Although it has been shown that process vectorization achieves a significant throughput improvement, this concept has two major drawbacks. First, the theoretical performance of a vectorized integration process mainly depends on the performance of the most cost-intensive operator. Second, the practical performance strongly depends on the number of available threads. In this paper, we present an advanced optimization approach that addresses the mentioned problems. Therefore, we generalize the vectorization problem and explain how to vectorize process plans in a cost-based manner. Due to the exponential complexity, we provide a heuristic computation approach and formally analyze its optimality. In conclusion of our evaluation, the message throughput can be significantly increased compared to both the instance-based execution as well as the rule-based process vectorization. Keywords: Cost-Based Vectorization, Integration Processes, Throughput Optimization, Pipes and Filters, Instance-Based.

1 Introduction Integration processes—as an abstraction of workflow-based integration tasks—are typically executed with the instance-based execution model [1]. Here, each incoming message conceptually initiates a new instance of the related integration process. Therefore, all messages are serialized according to their incoming order. This order is then used to execute single-threaded process plans. Example system categories for that execution model are EAI (enterprise application integration) servers, WfMS (workflow management systems) and WSMS (Web service management systems). Workflow-based integration platforms usually do not reach high resource utilization because of (1) the existence of single-threaded process instances in parallel processor architectures, (2) J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 253–269, 2009. c Springer-Verlag Berlin Heidelberg 2009

254

M. Boehm et al.

significant waiting times for external systems, and (3) IO bottlenecks (message persistence for recovery processing). Hence, the message throughput is not optimal and can be significantly optimized using a higher degree of parallelism. Other system types use the so-called pipes-and-filters execution model, where each operator is conceptually executed as a single thread and each edge between two operators contains a message queue. In order to overcome the problem of low resource utilization, in [2], we introduced the vectorization of instance-based integration processes. This approach describes the transparent rewriting of integration processes from the instance-based execution model to the pipes-and-filters execution model. In that context, different problems such as the assurance of serialized execution and different data flow semantics were solved. We use the term vectorization in analogy to the area of computational engineering because in the pipes-and-filters model, each operator executes a sequence (vector) of messages. Although full vectorization can significantly increase the resource utilization and hence improve the message throughput, the two following major shortcomings exist: Problem 1. Work-Cycle Domination: The work-cycle of a whole data-flow graph is dominated by the work-cycle of the most cost-intensive operator because all queues after this operator are empty while operators in front reach the maximum constraint of the queues. Hence, the theoretical performance mainly depends on that operator. Problem 2. Overload Resource Utilization: The practical performance strongly depends on the number of available threads. For full vectorization, the number of required threads is determined by the number of operators. Hence, if the number of required threads is higher than the number of available threads, performance will degenerate. In order to overcome those two drawbacks, in this paper, we propose the cost-based vectorization of integration processes. The core idea is to group the m operators of a process plan into k execution groups, then execute not each operator but each execution group with a single thread and hence, reduce the number of required threads. This approach is a generalization of the specific cases of instance-based (k = 1) and vectorized (k = m) integration processes. Therefore, here, we make the following contributions that also reflect the structure of the paper: – First, in Section 2, we revisit the vectorization of integration processes, explain requirements and problems, and define the integration process vectorization problem. – Subsequently, in Section 3, we introduce the sophisticated cost-based optimization approach. This approach overcomes the problem of possible inefficiency by applying the simple rule-based rewriting techniques. – Based on the details in Sections 2 and 3, we provide conceptual implementation details and the results of our exhaustive experimental evaluation in Section 4. Finally, we analyze the related work from different perspectives in Section 5 and conclude the paper in Section 6.

Cost-Based Vectorization of Instance-Based Integration Processes

255

2 Process Plan Vectorization Revisited In this section, we recall the core concepts from the process plan vectorization approach. Therefore, we explain assumptions and requirements, define the vectorization problem and give an overview of the rule-based vectorization approach. 2.1 Process Vectorization Overview Figure 1 illustrates an integration platform architecture for instance-based integration processes. Here, the key characteristics are a set of inbound adapters (passive listeners), Operational Datastore (ODS) several message queues, a central process engine, and Fig. 1. Generalized Integration Platform Architecture a set of outbound adapters (active services). The message queues are used as logical serialization elements within the asynchronous execution model. However, the synchronous as well as the asynchronous execution of process plans is supported. Further, the process engine is instance-based, which means that for each message, a new instance (one thread) of the specified process plan is created and executed. In contrast to traditional query optimization, in the area of integration processes, the throughput maximization is much more important than the optimization of the execution time of single process plan instances. Due to the requirement of logical serialization of messages, those process plan instances cannot be executed in a multi-threaded way. As presented in the SIR transaction model [3], we must make sure that messages do not outrun other messages; for this purpose, we use logical serialization concepts such as message queues. The requirement of serialized execution of process plan instances is not always necessary. We can weaken this requirement to serialized external behavior of process plan instances, which allows us to apply a more fine-grained serialization concept. Finally, also the transactional behavior must be ensured using compensation- or recovery-based transaction models. Based on the mentioned assumptions and requirements, we now formally define the integration process vectorization problem. Figure 2(a) illustrates the temporal aspects of a typical instance-based integration process. Semantically, in this example, a message is received from the inbound message queue (Receive), then a schema mapping (Translation) is processed and finally, the message is sent to an external system (Invoke). In this case, different instances of this process plan are serialized in incoming order. Such an instance-based process plan is the input of our vectorization problem. In contrast to this, Figure 2(c) shows the temporal aspects of a vectorized integration process. Here, only the external behavior (according to the start time T0 and the end time T1 of instances) must be serialized. Such a vectorized process plan is the output of the vectorization problem. This general problem is defined as follows. External System

Inbound Adapter 1

External System

...

External System

Inbound Adapter n

Outbound Adapter 1

External System

...

External System

...

External System

Outbound Adapter k

External System

Process Engine

Scheduler

256

p1

M. Boehm et al.

Receive

Translation

Message Queue

P => p1, p2, … pn

Invoke Receive

p2

Translation

time t T0(p1)

Process plan instance P1 Receive

Invoke

T1(p1) T0(p2)

T1(p2)

(a) Instance-Based Process Plan P Receive

p2 T0(p1)

Translation

Invoke

Receive

Translation

T0(p2)

T1(p1)

Invoke msg2

Standing process plan P’

P => p1, p2, … pn

Receive improvement due to vectorization

Invoke

Outbound Adapter 1

(b) Instance-Based Execution of P Message Queue

p1

Translation

Process context msg1 ctx_P1

time t

(c) Fully Vectorized Process Plan P

Invoke

Outbound Adapter 1

inter-bucket message queue execution bucket bi (thread)

T1(p2)

Translation

(d) Fully Vectorized Execution of P

Fig. 2. Overview of Vectorization of Integration Processes

Definition 1. Integration Process Vectorization Problem (IPVP): Let P denote a process plan and pi with pi = (p1 , p2 , . . . , pn ) denotes the implied process plan instances with P ⇒ pi . Further, let each process plan P comprise a graph of operators oi = (o1 , o2 , . . . , om ). For serialization purposes, the process plan instances are executed in sequence with T1 (pi ) ≤ T0 (pi+1 ). Then, the integration process vectorization problem describes the search for the derived process plan P that exhibits the highest degree of parallelism for the process plan instances pi such that the constraint conditions (T1 (pi , oi ) ≤ T0 (pi , oi+1 )) ∧ (T1 (pi , oi ) ≤ T0 (pi+1 , oi )) hold and the semantic correctness is ensured. Based on the IPVP, we now recall the static cost analysis, where in general, cost denotes the execution time. If we assume an operator sequence o with constant operator costs C(oi ) = 1, clearly, the following costs exist C(P ) = n · m C(P ) = n + m − 1

// instance-based // fully vectorized

Δ(C(P ) − C(P )) = (n − 1) · (m − 1) where n denotes the number of process plan instances and m denotes the number of operators. Clearly, this is an idealized model only used for illustration purposes. In practice, the improvement depends on the most time-consuming operator ok with C(ok ) = maxm i=1 C(oi ) of a vectorized process plan P , where the costs can be specified as follows: m C(P ) = n · C(oi ) i=1

C(P ) = (n + m − 1) · C(ok ) m Δ(C(P ) − C(P )) = n · C(oi ) − (n + m − 1) · C(ok ) . i=1∧i=k

Obviously, the performance improvement can even be negative in case of a very small number of process plan instances n. However, over time—and hence, with an increasing n—the performance improvement grows linearly.

Cost-Based Vectorization of Instance-Based Integration Processes

257

The general idea is to rewrite the instance-based process plan—where each instance is executed as a thread—to a vectorized process plan, where each operator is executed as a single execution bucket and hence, as a single thread. Thus, we model a standing process plan. Due to different execution times of the single operators, inter-bucket queues (with max constraints1 ) are required for each data flow edge. Figures 2(b) and 2(d) illustrate those two different execution models. As already shown, this offers high optimization potential but this exclusively addresses the process engine, while all other components can be reused without changes. However, at the same time, major challenges have to be solved when transforming P into P . 2.2 Message Model and Process Model As a precondition for vectorization, our formal foundation—the Message Transformation Model (MTM) [4]—was extended in order to make it applicable also in the context of vectorized integration processes (then we refer to it as VMTM). Basically, the MTM consists of a message model and an instance-based process model. We model a message m of a message type M as a quadruple with m = (M, S, A, D), where M denotes the message type, S denotes the runtime state, and A denotes a map of atomic name-value attribute pairs with ai = (n, v). Further, D denotes a map of message parts, where a single message part is defined with di = (n, t). Here, n denotes the part name and t denotes a tree of named data elements. In the VMTM, we extend it to a quintuple with m = (M, C, S, A, D), where the context information C denotes an additional map of atomic name-value attribute pairs with ci = (n, v). This extension is necessary due to parallel message execution within one process plan. A process plan P is defined with P = (o, c, s) as a 3-tuple representation of a directed graph. Let o with o = (o1 , . . . , om ) denote a sequence of operators, let c denote the context of P as a set of message variables msgi , and let s denote a set of services s = (s1 , . . . , sl ). Then, an instance pi of a process plan P , with P ⇒ pi , executes the sequence of operators once. Each operator oi has a specific type as well as an identifier N ID (unique within the process plan) and is either of an atomic or of a complex type. Complex operators recursively contain sequences of operators with oi = (oi,1 , . . . , oi,m ). Further, an operator can have multiple input variables msgi ∈ c, but only one output variable msgj ∈ c. Each service si contains a type, a configuration and a set of operations. Further, we define a set of interaction-oriented operators iop (Invoke, Receive and Reply), control-flow-oriented operators cop (Switch, Fork, Iteration, Delay and Signal) and data-flow-oriented operators dop (Assign, Translation, Selection, Projection, Join, Setoperation, Split, Orderby, Groupby, Window, Validate, Savepoint and Action). Furthermore, in the VMTM, the flow relations between operators oi do not specify the control flow but the explicit data flow in the form of message streams. Additionally, the Fork operator is removed because in the vectorized case, operators are modelinherently executed in parallel. Finally, we introduce the additional operators AND and XOR (for synchronization) as well as the COPY operator (for data flow splits). 1

Due to different execution times of single operators, queues in front of cost-intensive operators include larger numbers of messages. In order to overcome the problem of high memory requirements, we constrained the maximal number of messages per queue.

258

M. Boehm et al.

2.3 Rewriting Algorithm Basically, we distinguish between unary (one input) and binary (multiple input) operators. Both unary and binary operators of an instance-based process plan can be rewritten with the same core concept (see [2] for the Algorithm) that contains the following four steps. First, we create a queue instance for each data dependency between two operators (the output message of operator oi is the input message of operator oi+1 ). Second, we create an execution bucket for each operator. Third, we connect each operator with the referenced input queue. Clearly, each queue is referenced by exactly one operator, but each operator can reference multiple queues. Fourth, we connect each operator with the referenced output queues. If one operator must be connected to n output queues with n ≥ 2 (its results are used by multiple following operators), we insert a Copy operator after this operator. This Copy operator simply gets a message from one input queue, then copies it n − 1 times and puts those messages into the n output queues. Although this rewriting algorithm is only executed once for all process instances, it is important to notice the cubic complexity with O(m3 ) = O(m3 +m2 ), according to the number of operators m. This complexity is dominated by dependency checking when connecting operators and queues. Based on the standard rewriting concept, specific rewriting rules for context-specific operators (e.g., Switch) and for serialization and recoverability are required. Those rules and the related cost analysis are given in [2].

3 Cost-Based Vectorization During rule-based process plan vectorization, an instance-based process plan (one execution bucket for all operators) is completely vectorized (one execution bucket for each operator). This solves the integration process vectorization problem. However, the two major weaknesses of this approach are (1) that the theoretical performance of a vectorized integration process mainly depends on the performance of the most cost-intensive operator and (2) that the practical performance also strongly depends on the number of available threads (and hence, on the number of operators). Thus, the optimality of process plan vectorization strongly depends on dynamic workload characteristics. For instance, the full process plan vectorization can also hurt performance due to additional thread management if the instance-based process plan has already caused a 100-percent resource consumption. In conclusion, we extend our approach and introduce a more generalized problem description and an approach for the cost-based vectorization of process plans. Obviously, the instance-based process plan and the fully vectorized process plan are specific cases of this more general solution. 3.1 Problem Generalization The input (instance-based process plan) and the output (vectorized process plan) of the IPVP are extreme cases. In order to introduce awareness of dynamically changing workload characteristics, we generalize the IPVP to the Cost-Based IPVP as follows: Definition 2. Cost-Based Integration Process Vectorization Problem (CBIPVP): Let P denote a process plan and pi with pi = (p1 , p2 , . . . , pn ) denotes the implied process plan instances with P ⇒ pi . Further, let each process plan P comprise a graph

Cost-Based Vectorization of Instance-Based Integration Processes

259

Table 1. Example Operator Distribution

1 2 3 4 5 6 7 8

k=1 k=2 k=3 k=4

b1 o1 , o2 , o3 , o4 o1 o1 , o2 o1 , o2 , o3 o1 o1 o1 , o2 o1

b2 o2 , o3 , o4 o3 , o4 o4 o2 o2 , o3 o3 o2

b3 o3 , o4 o4 o4 o3

b4 o4

of operators oi = (o1 , o2 , . . . , om ). For serialization purposes, the process plan instances are executed in sequence with T1 (pi ) ≤ T0 (pi+1 ). The CBIPVP describes the search for the derived cost-optimal (minimal execution time of a message sequence) process plan P with k ∈ N execution buckets bi = (b1 , b2 , . . . , bk ), where each bucket contains l operators oi = (o1 , o2 , . . . , ol ). Here, the constraint conditions (T1 (pi , bi ) ≤ T0 (pi , bi+1 )) ∧ (T1 (pi , bi ) ≤ T0 (pi+1 , bi )) and (T1 (bi , oi ) ≤ T0 (bi , oi+1 )) ∧ (T1 (bi , oi ) ≤ T0 (pi+1 , bi )) must hold. We define that (lbi ≥ 1) ∧ (lbi ≤ |bi | m) and i=1 lbi = m and that each operator oi is assigned to exactly one bucket bi . If we reconsider the IPVP, on the one hand, an instance-based process plan P is a specific case of the cost-based vectorized process plan P , with k = 1 execution bucket. On the other hand, also the fully vectorized process plan P is a specific case of the costbased vectorized process plan P , with k = m execution buckets, where m denotes the number of operators oi . The following example illustrates that problem. Example 1. Operator distribution across buckets: Assume a simple process plan P with a sequence of four operators (m = 4). Table 1 shows the possible process plans for the different numbers of buckets k. We can distinguish eight different (24−1 = 8) process plans. Here, plan 1 is the special case of an instance-based process plan and plan 8 is the special case of a fully vectorized process plan. Theorem 1. The cost-based integration process vectorization problem exhibits an exponential complexity of O(2m ). Proof. The distribution function D of the number of possible plans over k is a symmetric distribution function according to Pascal’s Triangle, where the condition lbi = lbk−i+1 with i ≤ m 2 does hold. Based on Definition 1, a process plan contains m operators. Due to Definition 2, we search for k execution buckets bi with lbi ≥ 1 ∧ lbi ≤ m |bi | and i=1 lbi = m. Hence, m (k = 1, ..., k = m) different numbers of buckets have to be evaluated. From now on, we fix m as m = m − 1 and k as k = k − 1. In fact, there is only one possible plan for k = 1 (all operators in one bucket) and k = m (each operator in a different bucket), respectively. m m |P |k =0 = = 1 and |P |k =m = =1. 0 m Now, fix a specific m and k. Then, the number of possible plans is computed with

260

M. Boehm et al.

k m + 1 − i m m −1 m −1 |P |k = = + = . k k − 1 k i i=1

In order to compute the number of possible plans, we have to sum up the possible plans for each k, with 1 ≤ k ≤ m: m m |P | = with k = k − 1 and m = m − 1 . k k =0

n n Finally, k=0 is known to be equal to 2n . Hence, by changing the index k from k k = 0 to k = 1 we can write: m m m m−1 |P | = = = 2(m−1) . k k−1 k =0

k=1

In conclusion, there are 2(m−1) possible process plans that must be evaluated. Due to the linear complexity of O(m) for determining the costs of a plan, the cost-based integration process vectorization problem exhibits an exponential overall complexity with O(2m ) = O(m · 2(m−1) ). Note that we have a recursive algorithm because we need to include complex operators as well. For understandability, we simplified this to a sequence of atomic operators. 3.2 Heuristic Approach Due to the exponential complexity of the CBIPVP, a search space reduction approach for determining the cost-optimal solution for the CBIPVP is strongly needed. Here, we present a heuristic-based algorithm that solves the CBIPVP with linear complexity of O(m). The core heuristic of our approach is illustrated in Figure 3. Basically, we set k = m, where each operator is executed in a single execution bucket. Then, we merge those execution buckets in a cost-based fashion. Typically, the improvements achieved by vectorization mainly depend on the most time-consuming operator ok with C(P, ok ) = maxm i=1 C(P, oi ) of a process plan P . The reason is that the costs of a vectorized process plan are computed with C(P ) = (n+m−1)·C(ok ), and hence, the work cycle of the vectorized process plan is given by C(ok ). Thus, the time period of the start of two subsequent process plan instances is given by T0 (pi+1 ) − T0 (pi ) = C(ok ). It would be more efficient to leverage the queue waiting times and merge execution buckets. Hence, we use this heuristic to solve the constrained problem. Definition 3. Constrained CBIPVP: According to the CBIPVP, find the minimal number of buckets k and an assignment of operators oi with i = 1, .., m to those execution buckets bi with i = 1, ..k such that ∀bi : C(bi ) ≤ maxm i=1 C(oi ) + λ, and make sure that the assignment preserves the order with respect to the operator sequence o. Here, λ is a user-defined parameter (in terms of execution time) to control the cost constraint.

Cost-Based Vectorization of Instance-Based Integration Processes p1

o1

o2

o3

261

o4

p2

o1

o2

o3

o4

p3

o1

T0(p1)

T1(p1) T0(p2)

o2

o3

o4 time t T1(p3)

T1(p2) T0(p3)

(a) Instance-Based Process Plan P p1

o1

o2

p2

o3

o1

o4

o2

p3

o3

o1

T0(p1)

T0(p2)

o4

o2

T0(p3)

o3

T1(p1)

T1(p2)

o4

(b) Fully Vectorized Process Plan P p1

o1

p2

o2

o3 o1

p3 T0(p1)

o4 o2

o3 o1

T0(p2)

time t

T1(p3)

T0(p3) T1(p1)

o4 o2

o3 T1(p2)

o4 time t

T1(p3)

(c) Cost-Based Vectorized Process Plan P

Fig. 3. Work Cycle Domination by Operator o3

Figure 4 illustrates the influence of λ and the core idea of solving the constrained problem. Each operator oi has assigned costs C(oi ). In our example, it is maxm i=1 C(oi ) = C(o3 ) = 5. The Constrained CBIPVP describes the search for the minimal number of execution buckets, where the cumulative costs of each bucket must not be larger than the determined maximum plus a user-defined λ. Hence, in our example, we search for the k buckets, where the cumulative costs of each bucket are, at most,

Algorithm 1. Cost-Based Bucket Determination Require: operator sequence o 1: A ← , B ← , k ← 0 2: max ← maxm i=1 C(P , oi ) + λ 3: for i = 1 to |o| do 4: if oi ∈ A then 5: continue 3 6: end if 7: k ←k+1 8: bk (oi ) ← create bucket over oi 9: for j = i + 1 to |o| do k| 10: if |b c=1 C(oc ) + C(oj ) ≤ max then 11: bk ← add oj to bk 12: A ← A ∪ oj 13: else 14: break 9 15: end if 16: end for 17: B ← B ∪ bk 18: end for 19: return B

// foreach operator

// foreach following operator

262

M. Boehm et al.

equal to five. If we increase λ, we can reduce the number of buckets by increasing the allowed maximum and hence, the work cycle of the vectorized process plan. given operator Algorithm 2 illustrates the o o o o o o sequence o C(o )=1 C(o )=4 C(o )=5 C(o )=2 C(o )=3 C(o )=1 concept of the cost-based bucket determination algo- Ȝ=0 (max C(b )=5) o o o o o o k=4 rithm. Here, the operator 2 Ȝ=1 (max C(b )=6) o o o o o o sequence o is required. First, k=3 we initialize two sets A and Ȝ=2 (max C(b )=7) o o o o o o k=3 B as empty sets. After that, we compute the maximal Fig. 4. Bucket Merging with Different λ costs of a bucket max with max = maxm i=1 C(oi ) + λ. Then, there is the main loop over all operators. If the operator oi belongs to A (operators already assigned to buckets), we can proceed with the next operator. Otherwise, we create a new bucket bk and increment k (number of buckets) accordingly. After that, we execute the inner loop in order to assign operators |bk | to this bucket such that the constraint c=1 C(oc ) ≤ max holds. This is done by adding oj to bk and to A. Here, we can ensure that each created bucket has at least one operator assigned. Finally, each new bucket bk is added to the set of buckets B. 1

1

i

i

i

2

2

3

3

4

4

5

5

6

6

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

Theorem 2. The cost-based bucket determination algorithm solves the constrained cost-based integration process vectorization problem with linear complexity of O(m). Proof. Assume a process plan that comprises a sequence of m operators. First, the maximum of a value list (line 2) is known to be of complexity O(m) (m operator evaluations). Second, we can see that the bucket number is at least 1 (all operators assigned to one bucket) and at most m (each operator assigned to exactly one bucket). Third, in the case of k = 1 there are at most 2m − 1 possible operator evaluations. Also, in the case of k = m there are at most 2m − 1 possible operator evaluations. If we assume that the operations ∈ and ∪—in our case—exhibit constant time complexity of O(1), we now can conclude that the cost-based bucket determination algorithm exhibits a linear complexity with O(m) = O(3m − 1) = O(m) + O(2m − 1). 3.3 Optimality Analysis As already mentioned, the optimality of the vectorized process plan depends on (1) the costs of the single operators, (2) the resource consumption of each operator and (3) the available hardware resources (possible parallelism). However, the cost-based bucket determination algorithm only takes the costs from (1) into consideration. Nevertheless, we show that optimality guarantees can be given using this heuristic approach. The algorithm can be parameterized with respect to the hardware resources (3). If m we want to force a single-threaded execution, we simply set λ to λ ≥ i=1 C(oi ) − maxm i=1 C(oi ). If we want to force the highest meaningful degree of parallelism (this is not necessarily a full vectorization), we simply set λ = 0. 2

Note that each process plan is a sequence of atomic and complex operators. Due to those complex operators, the cost-based bucket determination algorithm is a recursive algorithm. However, we transformed it to a linear one in order to show the core concept.

Cost-Based Vectorization of Instance-Based Integration Processes

263

Now, assuming the given λ configuration, the question is, which optimality guarantee can we give for the solution of the cost-based bucket determination algorithm. For this purpose, Re (oi ) denotes the empirical resource consumption (measured with a specific configuration) of an operator oi with 0 ≤ Re (oi ) ≤ 1, and Ro (oi ) denotes the maximal resource consumption of an operator oi with 0 ≤ Ro (oi ) ≤ 1. Here, Ro (oi ) = 1 means that the operator moi exhibits an average resource consumption of 100 percent. In fact, the condition i=1 Re (oi ) ≤ 1 must hold. Obviously, for an instance-based plan P , we can write Re (oi ) = Ro (oi ) because all operators are executed in sequence. When we vectorize P to a fully vectorized plan 1 P , with a maximum of Re (oi ) = m , we have to compute the costs with C(oi ) = Ro (oi ) Re (o ) · C(oi ). When we merge two execution buckets bi and bi+1 during cost-based i

bucket determination, we compute the effective resource consumption Re (bi ) = C(bi )·Ro (bi )+C(bi+1 )·Ro (bi+1 ) C(bi )+C(bi+1 )

1 |b|−1 ,

the maximal resource consumption Ro (bi ) = , and the cost ⎧ ) ⎨ Re (bi ) · C(b ) + Re (bi+1 · C(bi+1 ) Re (bi ) ≤ Ro (bi ) i Re (bi ) Re (b i+1 ) C(bi ) = R (b ) . ) ⎩ o i · C(b ) + Ro (bi+1 · C(b ) otherwise i i+1 Ro (b ) Ro (b ) i

i+1

Obviously, we made the assumption that each execution bucket gets the same maximal resource consumption Re (bi ) and that resources are not exchanged between those buckets. We do not take the temporal overlap into consideration. However, we can give the following optimality guarantee. Theorem 3. The cost-based bucket determination algorithm solves the cost-based integration process vectorization problem with an optimality guarantee of (C(P ) ≤ C(P )) ∧ (C(P ) ≤ C(P )) under the restriction of λ = 0. Proof. As a precondition it is important to notice that the cost-based bucket determination algorithm cannot result in a plan with k = 1 (although this is a special case of the k| CBIPVP) due to the maximum rule of |b c=1 C(oc ) + C(oj ) ≤ max (Algorithm 2, line 10). Hence, in order to prove the theorem, we only need to prove the two single claims of C(P ) ≤ C(P ) and C(P ) ≤ C(P ). For the proof of C(P ) ≤ C(P ), assume the worst case where ∀oi ∈ Ro (oi ) = 1. R (b ) If we vectorize this to P , we need to compute the costs by C(bi ) = Roe (bi ) · C(oi ) i

1 m with Re (b i ) = m . Due to the vectorized execution, C(P ) = maxi=1 C(bi ), while m C(P ) = i=1 C(oi ). Hence, we can write C(P ) = C(P ) if the condition ∀oi ∈ Ro (oi ) = 1 holds. This is the worst case. For each Ro (oi ) < 1, we get C(P ) < C(P ). In order to prove C(P ) ≤ C(P ), we set λ = 0. If we merge two buckets bi 1 1 and bi+1 , we see that Re (bi ) is increased from |b| to |b|−1 . Thus, we re-compute the costs C(bi ) as mentioned before. In the worst case, C(bi ) = C(bi ), which is true iff Re (bi ) = Ro (bi ) because then we also have Re (bi ) = Re (bi ). Due to C(P ) = maxm i=1 C(bi ), we can state C(P ) ≤ C(P ). Hence, the theorem holds.

In conclusion, we cannot guarantee that the result of the cost-based bucket determination algorithm is the global optimum because we cannot evaluate the effective resource

264

M. Boehm et al.

consumption in an efficient way. However, we can guarantee that each merging of execution buckets when solving the CBIPVP improves the performance of the process plan P . Hence, we follow a best-effort optimization approach. 3.4 Dynamic Process Plan Rewriting Due to the cost-based bucket determination approach, the dynamic process plan rewriting is required. The major problem when rewriting a vectorized process plan during runtime is posed by loaded queues. The used queues can be stopped using the stopped flag. If we—for example—want to merge two execution buckets bi and bi+1 , we need to stop the queue qi that is represented by the edge between bi−1 and bi . Then, we wait until the queue qi+1 (the queue just between bi and bi+1 ) contains 0 messages. Now, we can merge the execution buckets to bi and simply remove qi+1 . This concept can be used for bucket merging and splitting, respectively. Finally, the rewriting algorithm is triggered only if a different plan than the current one has been determined by the costbased bucket determination algorithm. In fact, we need to compare two plans P1 and P2 by graph matching. However, this is known to be of linear complexity with O(m).

4 Experimental Evaluation In this section, we provide selected experimental results. In general, we can state that the vectorization of integration processes leads to a significant performance improvement. In fact, we can show that Theorem 4 (C(P ) ≤ C(P ) ∧ C(P ) ≤ C(P ) under the restriction of λ = 0) does also hold during experimental performance evaluation. 4.1 Experimental Setup We implemented the presented approach within our so-called WFPE (workflow process engine) using Java 1.6 as the programming language. Here, we give a brief overview of the WFPE and discuss some implementation details. In general, the WFPE uses compiled process plans (a java class is generated for each integration process type). Furthermore, it follows an instance-based execution model. Now, we integrated components for the static vectorization of integration processes (we call this VWFPE) and for the costbased vectorization (we call this CBVWFPE). For that, new deployment functionalities were introduced (those processes are executed in an interpreted fashion) and several changes in the runtime environment were required. Finally, all three different runtime approaches can be used alternatively. We ran our experiments on a standard blade (OS Suse Linux) with two processors (each of them a Dual Core AMD Opteron Processor 270 at 1,994 MHz) and 8.9 GB RAM. Further, we executed all experiments on synthetically generated XML data (using our DIPBench toolsuite [5]) because the data distribution of real data sets has only minor influence on the performance of the integration processes used here. However, there are several aspects that influence the performance improvement of the vectorization and hence, these should be analyzed. In general, we used the following five aspects as scale factors for all three execution approaches: data size d of a message, the number

Cost-Based Vectorization of Instance-Based Integration Processes

265

of nodes m of a process plan, the time interval t between two messages, the number of process instances n and the maximal number of messages q in a queue. Here, we measured the performance of different combinations of those. For statistical correctness, we repeated all experiments 20 times and computed the arithmetic mean. As base integration process for our experiments, we modeled a simple sequence of six operators. Here, a message is received (Receive) and then an archive writing is prepared (Assign) and executed with the file adapter (Invoke). After that, the resulting message (contains Orders and Orderlines) is translated using an XML transformation (Translation) and finally sent to a specific directory (Assign, Invoke). We refer to this as m = 5 because the Receive is removed during vectorization. When scaling m up to m = 35, we simply copy the last five operators and reconfigure them. 4.2 Performance and Throughput Here, we ran a series of five experiments according to the already introduced influencing aspects. The results of these experiments are shown in Figure 5. Basically, the five experiments correlate to the mentioned scale factors. In Figure 5(a), we scaled the data size d of the XML input messages from 100 kb to 700 kb and measured the processing time for 250 process instances (n = 250) needed by the three different runtimes. There, we fix m = 5, t = 0, n = 250 and q = 50. We can observe that all three runtimes exhibit a linear scaling according to the data size and that significant improvements can be reached using vectorization. There, the absolute improvement increases with increasing data size. Further, in Figure 5(d), we illustrated the variance of this sub-experiment. The variance of the instance-based execution is minimal, while the variances of both vectorized runtimes are worse because of the operator scheduling. Note that the cost-based vectorization exhibits a significantly lower variance than in the fully vectorized case because of a lower number of threads. Now, we fix d = 100 (lowest improvement in 5(a)), t = 0, n = 250 and q = 50 in order to investigate the influence of m. We vary m from 5 to 35 nodes as already mentioned for the experimental setup. Interestingly, not only the absolute but also the relative improvement of vectorization increases with increasing number of operators. In comparison to full vectorization, for cost-based vectorization, a constant relative improvement is observable. Figure 5(c) shows the impact of the time interval t between the initiation of two process instances. For that, we fix d = 100, m = 5, n = 250, q = 50 and vary t from 10 ms to 70 ms. There is almost no difference between the full vectorization and the cost-based vectorization. However, the absolute improvement between instance-based and vectorized approaches decreases slightly with increasing t. An explanation is that the time interval has no impact on the instance-based execution. In contrast to that, the vectorized approach depends on t because of resource scheduling. Further, we analyze the influence of the number of instances n as illustrated in Figure 5(e). Here, we fix d = 100, m = 5, t = 0, q = 50 and vary n from 100 to 700. Basically, we can observe that the relative improvement between instance-based and vectorized execution increases when increasing n, due to parallelism of process instances. However, it is interesting to note that the fully vectorized solution performs

266

M. Boehm et al.

(a) Scalability over d

(b) Scalability over m

(c) Scalability over t

(d) Variance over d

(e) Scalability over n

(f) Scalability over q

Fig. 5. Experimental Performance Evaluation Results

slightly better for small n. However, when increasing n, the cost-based vectorized approach performs optimal. Figure 5(f) illustrates the influence of the maximal queue size q, which we varied from 10 to 70. Here, we fix d = 100, m = 5, t = 0 and n = 250. In fact, q slightly affects the overall performance for a small number of concurrent instances n. However, at n = 250, we cannot observe any significant influence with regard to the performance. 4.3 Deployment and Maintenance The purpose of this experiment was to analyze the deployment overhead of three different runtimes. We measured the costs for the process plan vectorization (PPV) algorithm and the periodically invoked cost-based bucket determination (CBBD) algorithm. Figure 6 shows those results. Here, we varied the number of nodes m because all other scale factors do not influence the deployment and maintenance costs. In general, there is a huge performance Fig. 6. Vectorization Overhead Analysis improvement using vectorization with a factor of up to seven. It is caused by the different deployment approaches. The WFPE uses a compilation approach, where java classes are generated from the integration process specification. In contrast to this, the

Cost-Based Vectorization of Instance-Based Integration Processes

267

VWFPE as well as the CBVWFPE use interpretation approaches, where process plans are built dynamically with the PPV algorithm. In fact, VWFPE always outperforms CBVWFPE because both use the PPV algorithm but CBVWFPE additionally uses the CBBD algorithm in order to find the optimal k. Note that the additional costs for the CBBD algorithm (that cause a break-even point with the standard WFPE) occur periodically during runtime (period is a parameter). In conclusion, the vectorization of integration processes allows for better runtime as well as better deployment time performance. Hence, this approach can be used under all circumstances.

5 Related Work We review related work from the perspectives of computational engineering, database management systems, data stream management systems, streaming service and process execution, and integration process optimization. Computational Engineering. Based on Flynn’s classification, vectorization can be classified as SIMD (single instruction, multiple data) and in special cases as MIMD (multiple instruction, single data). Here, we use the term vectorization only as analogy. Database Management Systems. In the context of DBMS, throughput optimization has been addressed with different techniques. One significant approach is data sharing across common subexpressions of different queries or query instances [6,7,8,9]. However, in [10] it was shown that sharing can also hurt performance. Another inspiring approach is given by staged DBMS [11]. Here, in the QPipe Project [12,13], each relational operator was executed as a so-called micro-engine (one operator, many queries). Further, throughput optimization approaches were introduced also in the context of distributed query processing [14,15]. Data Stream Management Systems. Further, in the context of data stream management systems (DSMS) and ETL tools, the pipes-and-filters execution model is widely used. Examples for those systems are QStream [16], Demaq [17] and Borealis [18]. Surprisingly, the cost-based vectorization has not been used so far because the operator scheduling [19,20,21,22] in DSMS is not realized with multiple processes or threads but with central control strategies (assuming high costs for switching the process context). Furthermore, there is one interesting approach [23], where operators are distributed across a number of threads in a query-aware manner. However, this approach does not compute the cost-optimal distribution. Streaming Service and Process Execution. In service-oriented environments, throughput optimization has been addressed on different levels. Performance and resource issues, when processing large volumes of XML documents, lead to message chunking on service-invocation level. There, request documents are divided into chunks and services are called for every single chunk [24]. An automatic chunk-size computation using the extremum-control approach was addressed in [25]. On process level, pipeline scheduling was incorporated in [26] into a general workflow model to show the valuable benefit of pipelining in business processes. Further, [1] adds pipeline semantics to classic step-by-step workflows by extending available task states and utilizing a one-item queue between two consecutive tasks. None of those approaches deals with cost-based rewriting of instance-based processes to pipeline semantics.

268

M. Boehm et al.

Integration Process Optimization. Optimization of integration processes has not yet been explored sufficiently. There are platform-specific optimization approaches for the pipes-and-filters execution model, like the optimization of ETL processes [27], as well as numerous optimization approaches for instance-based processes like the optimization of data-intensive decision flows [28], the static optimization of the control flow, the use of critical path approaches [29] and SQL-supporting BPEL activities and their optimization [30]. Further, we investigated the optimization of message transformation processes [4] and the cost-based optimization of instance-based integration processes [31]. Finally, the rule-based vectorization approach—presented in [2]—was the foundation of our cost-optimal solution to the vectorization problem.

6 Conclusions In order to optimize the throughput of integration platforms, in this paper, we revisited the concept of automatic vectorization of integration processes. Due to the dependence on the dynamic workload characteristics, we introduced the cost-based process plan vectorization, where the costs of single operators are taken into account and operators are merged to execution buckets. Based on our experimental evaluation, we can state that significant throughput improvements are possible. In conclusion, the concept of process vectorization is applicable in many different application areas. Future work can address specific optimization techniques for the cost-based vectorization.

References 1. Biornstad, B., Pautasso, C., Alonso, G.: Control the flow: How to safely compose streaming services into business processes. In: IEEE SCC (2006) 2. Boehm, M., Habich, D., Lehner, W., Wloka, U.: Vectorizing instance-based integration processes. In: ICEIS (2009), http://wwwdb.inf.tu-dresden.de/team/archives/2007/04/ dipl wirtinf ma.php 3. Boehm, M., Habich, D., Lehner, W., Wloka, U.: An advanced transaction model for recovery processing of integration processes. In: ADBIS (2008) 4. Boehm, M., Habich, D., Wloka, U., Bittner, J., Lehner, W.: Towards self-optimization of message transformation processes. In: ADBIS (2007) 5. Boehm, M., Habich, D., Lehner, W., Wloka, U.: Dipbench toolsuite: A framework for benchmarking integration systems. In: ICDE (2008) 6. Dalvi, N.N., Sanghai, S.K., Roy, P., Sudarshan, S.: Pipelining in multi-query optimization. In: PODS (2001) 7. Hasan, W., Motwani, R.: Optimization algorithms for exploiting the parallelismcommunication tradeoff in pipelined parallelism. In: VLDB (1994) 8. Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. In: SIGMOD (2000) 9. Wilschut, A.N., van Gils, S.A.: A model for pipelined query execution. In: MASCOTS (1993) 10. Johnson, R., Hardavellas, N., Pandis, I., Mancheril, N., Harizopoulos, S., Sabirli, K., Ailamaki, A., Falsafi, B.: To share or not to share? In: VLDB (2007) 11. Harizopoulos, S., Ailamaki, A.: A case for staged database systems. In: CIDR (2003)

Cost-Based Vectorization of Instance-Based Integration Processes

269

12. Gao, K., Harizopoulos, S., Pandis, I., Shkapenyuk, V., Ailamaki, A.: Simultaneous pipelining in qpipe: Exploiting work sharing opportunities across queries. In: ICDE (2006) 13. Harizopoulos, S., Shkapenyuk, V., Ailamaki, A.: Qpipe: A simultaneously pipelined relational query engine. In: SIGMOD (2005) 14. Ives, Z.G., Florescu, D., Friedman, M., Levy, A.Y., Weld, D.S.: An adaptive query execution system for data integration. In: SIGMOD (1999) 15. Lee, R., Zhou, M., Liao, H.: Request window: an approach to improve throughput of rdbmsbased data integration system. In: VLDB (2007) 16. Schmidt, S., Berthold, H., Lehner, W.: Qstream: Deterministic querying of data streams. In: VLDB (2004) 17. Boehm, A., Marth, E., Kanne, C.C.: The demaq system: declarative development of distributed applications. In: SIGMOD (2008) 18. Abadi, D.J., Ahmad, Y., Balazinska, M., C ¸ etintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.B.: The design of the borealis stream processing engine. In: CIDR (2005) 19. Babcock, B., Babu, S., Datar, M., Motwani, R., Thomas, D.: Operator scheduling in data stream systems. VLDB J. 13(4) (2004) 20. Carney, D., C ¸ etintemel, U., Rasin, A., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Operator scheduling in a data stream manager. In: VLDB (2003) 21. Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: VLDB (2004) 22. Schmidt, S., Legler, T., Schaller, D., Lehner, W.: Real-time scheduling for data stream management systems. In: ECRTS (2005) 23. Cammert, M., Heinz, C., Kr¨amer, J., Seeger, B., Vaupel, S., Wolske, U.: Flexible multithreaded scheduling for continuous queries over data streams. In: ICDE Workshops (2007) 24. Srivastava, U., Munagala, K., Widom, J., Motwani, R.: Query optimization over web services. In: VLDB (2006) 25. Gounaris, A., Yfoulis, C., Sakellariou, R., Dikaiakos, M.D.: Robust runtime optimization of data transfer in queries over web services. In: ICDE (2008) 26. Lemos, M., Casanova, M.A., Furtado, A.L.: Process pipeline scheduling. J. Syst. Softw. 81(3) (2008) 27. Simitsis, A., Vassiliadis, P., Sellis, T.: Optimizing etl processes in data warehouses. In: ICDE (2005) 28. Hull, R., Llirbat, F., Kumar, B., Zhou, G., Dong, G., Su, J.: Optimization techniques for data-intensive decision flows. In: ICDE (2000) 29. Li, H., Zhan, D.: Workflow timed critical path optimization. Nature and Science 3(2) (2005) 30. Vrhovnik, M., Schwarz, H., Suhre, O., Mitschang, B., Markl, V., Maier, A., Kraft, T.: An approach to optimize data processing in business processes. In: VLDB (2007) 31. Boehm, M., Habich, D., Lehner, W., Wloka, U.: Workload-based optimization of integration processes. In: CIKM (2008)

Empowering Provenance in Data Integration* Haridimos Kondylakis, Martin Doerr, and Dimitris Plexousakis Information Systems Laboratory FORTH-ICS Computer Science Department, University of Crete {kondylak,martin,dp}@ics.forth.gr

Abstract. The provenance of data has recently been recognized as central to the trust one places in data. This paper presents a novel framework in order to empower provenance in a mediator based data integration system. We use a simple mapping language for mapping schema constructs, between an ontology and relational sources, capable to carry provenance information. This language extends the traditional data exchange setting by translating our mapping specifications into source-to-target tuple generating dependencies (s-t tgds). Then we define formally the provenance information we want to retrieve i.e. annotation, source and tuple provenance. We provide three algorithms to retrieve provenance information using information stored on the mappings and the sources. We show the feasibility of our solution and the advantages of our framework. Keywords: Data Integration, Provenance, Mappings.

1 Introduction Despite the continuous work on data integration systems, the results of the research have been slow to come to market. Buneman [1] suggests that this because three reasons: a) the complexity of the tools and principles created, b) the difficulty of the domain experts to understand the schemas, the terminologies etc. and c) that the narrow view of data integration, has made the problem more difficult and ignored several other important and related issues. The traditional view of database integration is that we have one or more source databases Si and one wants to issue queries on them as if the queries were issued in a new database T which represents the combined information in Si. The end user typically knows almost nothing about the source databases. While in some cases the answer speaks for itself, in other cases the user will not be confident in the answer, unless s/he can trust the reasons why such an answer has been produced and where the data has come from. Users want to know about the reliability or the accuracy of the data they see. Thus, a data integration system, to gain the trust of a user must be able, if required, to provide an explanation or justification for an answer i.e. to provide provenance information [2]. Since the answer is usually the result of a reasoning process, the justification can be given as well as a derivation of the conclusion with the sources of information for the various steps. *

This work was partially supported by the EU project plutIt (ICT-231430).

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 270–285, 2009. © Springer-Verlag Berlin Heidelberg 2009

Empowering Provenance in Data Integration

271

According to [3] there are two basic views of provenance. The first one describes the provenance of data as the processes that lead to its creation and the other one focuses on the source data from which the data item is derived from. Buneman et al. moreover distinguish [4] between why- and where- provenance. While whyprovenance describes why a data item is in the database, where-provenance describes where a piece of data comes from. In this paper we present a framework that is capable of presenting both why and where explanations to users for the answers it computes in response to their queries. More specific: 1. We present shortly a mapping language for information integration (Section 2) that is practical yet powerful enough and that we can use to annotate source schemata. 2. We extend the well-studied relational-to-relational data exchange setting to ontologyto-relational taking advantage of the polynomial complexity for query answering. This is done using an algorithm for translating our ontology-to-relational mappings into source-to-target tuple generating dependencies (s-t tgds). 3. Then, we define formally (Section 3) the provenance information we are trying to capture. Specifically, we define tuple, annotation and source provenance and we present three algorithms with polynomial complexity to retrieve such information. To the best of our knowledge no other provenance framework is capable of presenting source, tuple and annotation provenance information. Our system does not require additional information to be stored on sources; neither changes the underlying engine, nor uses a specialized query language.

2 The BABEL Framework The problem of encoding information in a standardized way has been a challenge addressed in different ways by different communities. The database community has adopted entity-relationship models, inheritance and declarative views, whereas the knowledge representation community has been adopted ontologies and description logics.The largest quantity of existing data is stored using conventional relational database technology. Ontologies on the other side provide conceptual domain models, which are understandable to both human beings and machines as a shared conceptualization of a given specific domain [5]. Both approaches have a role today in encoding and sharing information and in fact the two worlds are increasingly interconnected. The key in bringing legacy data with formal semantic meaning has been recognized to be the inclusion of mediation between traditional database contents and ontologies [6]. Several attempts have been made in order to integrate these disparate formalisms such as mediators or RDBMS extensions in order to handle ontologies [7]. Our thesis in this paper is that the former approach is the most useful since we cannot usually change the RDBMS that handles the existing data. Following the terminology from [8], a data integration system is a triple (G, S, M), where G is the global schema, S is the source schema, and M is the set of assertions relating elements of the global schema with elements of the source schema. In many cases the term target schema is used instead of global schema and in this paper we

272

H. Kondylakis, M. Doerr, and D. Plexousakis

will adopt that terminology as well. Generally speaking, the goal of a data integration system is to provide a common interface to various data sources, so as to enable users to focus on specifying what they want. In this paper we use as example ontology the CIDOC CRM Conceptual Model which is a core ontology for information integration [9] mainly used for cultural heritage documentation. A part of the ontology is shown in Fig. 1 and describes persons and biological objects belonging to those persons. The source databases shown belong to two hospitals and are used to capture medical information. More specific, in the first hospital he have one table that stores information about patients and one table that stores samples of that patients. In the second database we have a table that has information about tumors participating in biological experiments. SOURCE INSTANCE I HOSPITAL 2 Tumors

HOSPITAL 1 Patient SSN s1: 1048 s2: 4356 s3: 5325

Name John Smith Angela Samiou Cristine Dallas

City Heraklion Athens Heraklion

Birthday 25-09-1981 18-05-1965 23-04-1956

SampleID s6: SURG1

Owner Quality John Malkovits 2

Samples SName Owner s4: PAGNH1 4356 s5: PAGNH2 4356

Description DateTaken Cancer tissue 22-05-2007 Frozen sample 13-04-2006

GLOBAL SCHEMA

Fig. 1. The source Instance and the ontology used

In order to link similar concepts or relationships from different sources to a target schema, by way of an equivalent relation, a specific task is required. This is the mapping definition process and output of this task is the mapping, i.e. a collection of mapping rules [10]. A mapping is the specification of a mechanism for transforming the elements of a model conforming to a particular meta-model into elements of another model that conforms to another meta-model. Despite its pervasiveness and the substantial amount of work done in this area, the mapping definition process remains a very difficult problem. In practice it is done manually with the help of graphical user interfaces and it is a labour-intensive and error-prone activity even for humans. It is common that 60-80% of the resources in a data sharing problem are spent on reconciling semantic heterogeneity [11].

Empowering Provenance in Data Integration

273

The process of creating mapping rules is not a straightforward task since several conditions and cases may exist and multiple steps may be involved. An efficient mapping mechanism should be simple and intuitive enough so that the domain expert can understand it, use it, or at least verify it and should be expressive enough to capture the underlying schema semantics and the cases of heterogeneity encountered. Furthermore, an efficient mapping mechanism should be independent of the various implementation formats and of actual data transformation or mediation algorithms, and suitable to be used by subsequent data transformation procedures. 2.1 The Mapping Definition Several mapping languages have been employed for that purpose. Mapping between elements in schemas are usually expressed either as instances in an ontology of mappings, either as bridging axioms in first order logic to represent transformations, or using views to describe mappings from a target schema to local ones. An overview of the most representative mapping languages can be found in [12], [13]. However most of them are too complex to be understood by a domain expert and none of them provides annotation capability in order to store provenance information. In this paper we present shortly a simple and intuitive mapping format. For a full description please refer to [14]. In order to produce a simple and expressive language the basic mapping construct we use is proposition. The idea is that we convert the target and the source schemata into propositions and we map those to each other. Every model can be represented as class-role-class triples (c-r-c) i.e. a set of propositions, using nodes and links. The conversion into propositions is a trivial procedure that should be done automatically by a proper tool. The main arguments for mappings triples instead of single classes/fields are: a) the simplicity of the approach and b) that problems occur when trying to map single classes/fields. Those problems appear since usually we can map a field to several partially overlapping classes (-depending on data instance values) from the target schema and we cannot make a clear choice. Matching discovery tools, as long as the mapping rules are based on simple class correspondences, are really difficult to identify to which level of abstraction a correspondence needs to be found. Therefore, schema mapping cannot be based exclusively on classes.

Fig. 2. The basic mapping scheme (on the left) and an annotated mapping (on the right)

274

H. Kondylakis, M. Doerr, and D. Plexousakis

However, using the additional information that the relationships could provide we can decide the appropriate level of abstraction. Since, there is no proposition without a relationship we take advantage of those relationships to define the mapping correspondences. The mapping allows creating sets of propositions equivalent to the meaning of each source proposition, but in terms of the target schema. As the propositions are self-explanatory, they can be merged into huge knowledge pools, ignoring the boundaries of the source documents they were derived from. The basic mapping schema is shown in Fig. 2 (on the left). In some cases, we may need to combine paths sharing the same instances. Each c-r-c of the source schema is mapped individually to the target schema. Each class-role-class can be seen as selfexplanatory, context independent proposition. Having that in mind we need to define: 1. 2. 3. 4.

The mapping between the source domain classes and the target domain classes. The mapping between the source range classes and the target range classes. The proper source and target path. The mappings between source path and target path

The conversion of the source and the target schema into propositions and their corresponding graphical mappings are shown in Fig. 3. For example the Patient table maps to the “E21 Person” class and the patient’s SSN maps to “E43 Object Identifier”.

Fig. 3. The graphical mappings of our example

Empowering Provenance in Data Integration

275

An interesting thing about this specific mapping language is that we can annotate classes with additional information. This information can be either terms from a specific domain ontology/ taxonomy or just user comments. For example consider that we know that the biological objects from the second hospital are all breast cancer tumors but this information cannot be stored using the ontology shown in Fig. 1. We could annotate the class “E20 Biological Object” with the term “Breast Cancer tissue” coming from a cancer domain ontology/terminology. Moreover, we might want to say that the date format of the field “DateTaken” is “dd-mm-YYYY”. The final mapping is shown on Fig. 2 (on the right). Note that there is no restriction on the type of information that can be stored in the mappings. They can be either terms from a terminology or a whole path from another ontology or just comments from a domain expert. Moreover, there is no restriction on the “bag element” that will be used to transfer that information as long as it is agreed upon. Here we used the “E55 Type” class but it could be “Notes”, “Annotation element” or something else. Furthermore, since only the domain expert can fully understand the specific semantics of each source database he is capable to annotate the specific mappings with the information he may think important for the final user and the right time to perform that annotation is during the mapping process. 2.2 A Formal Definition of Our Language The mapping rules presented are simple enough to be produced and understood by an IT expert and this has been verified in practice1. In order to be used by a mediation/translation algorithm they should be further translated internally into a set of constraints. Before going further we shall introduce various concepts from the data exchange framework [15] that will be used here. The specification of a data exchange setting is given by a schema mapping M=(S, T, Σst ,Σt ), where S is a source schema, T is a target schema, Σst is a set of source to target dependencies (s-t dependencies) and Σt is a set of target dependencies. In the relational-to-relational data exchange framework as presented in [15], Σst is a finite set of s-t tuple generating dependencies (tgds) and the Σt is the union of a finite set of target tgds with a finite set of target equality generating dependencies (egds). A s-t tgd has the form ∀ xφ(x) Æ ∃ yψ(x, y) where φ(x) is a conjunction of atomic formulas over S and ψ(x, y) is a conjunction of atomic formulas over T. A target tgd has a similar form, except that φ(x) is a conjunction of atomic formulas over T. A target egd is of the form ∀ xφ(x) Æ x1=x2 where φ(x) is a conjunction of atomic formulas over T, and x1 and x2 are variables that occur in x. A data exchange setting (S, T, Σst, Σt ) can be thought of as a data integration system in which S is the source schema, T and Σt form the target schema and the sourceto-target dependencies in Σst are the assertions of the data integration system. In this paper we extend traditional relational-to-relational data exchange setting in a relational-to-ontology one. This is done by the translation of the mappings presented before to s-t tgds. In our case we don’t have target schema dependencies (Σt = {}), thus our system corresponds to a LAV data integration system [15]. In Fig. 4 we present the general algorithm that does not handle conditions but can be easily extended to cover those as well. 1

http://cidoc.ics.forth.gr/crm_mappings.html

276

H. Kondylakis, M. Doerr, and D. Plexousakis

At first the algorithm tries to find all the mappings for a specific source table. For every source table a new tgd is being constructed having as fields the corresponding source ranges. For example, using the mappings shown in Fig. 3, we can identify that the first four (the left ones) are for the table Patients for the fields SSN, Name, City, Birthday. So, we add a s-t tgd with left-hand-side (LHS) Patients(ssn, nm, ct, bd) according to line 6. Then we construct the right hand side (RHS) of the tgd by adding the relational atom E21Person(X) since it is the target domain of all the graphical mappings (line 8). Next we add in RHS all classes and properties of all target paths as relational atoms (line 10). Finally, we add the relational atoms named after the corresponding target ranges (line 11). For example, for the first mapping we add the conjunct E42ObjectIdentifier (ssn) in the RHS of the first tgd. The translation of the mappings shown in Fig. 3 is presented in Fig. 5. Algorithm 3.1: ComputeTgds (S) Input: The set of the mappings S Output: The set with the corresponding tgds T 1. Initialize T={} 2. Let sdi, spi, sri, tdi, tpi and tri denote the source domain, the source path, the source range, the target domain, the target path and the target range of a mapping mi respectively. 3. Let M1, …, Mi (i<|S|) be the set of the mappings for each table (using the joined_on attribute)

4. For every Mi add a tgd ∀ xφ(x) Æ ∃ yψ(x,y) in T so that: 5. a. the left-hand side is the relation sdi and 6. For each sri in Mi add sri as a field of the relation sdi 7. b. Construct the right-hand side as follows 8. i. Add relation tdi 9. ii. For each class C in tpi add a relational atom C(x) 10. iii. For each role P in tpi connecting C(x) with C(y) add the relation P(x, y) 11. iv. For each class C in tri add the relation tri (tri) 12. Return T

Fig. 4. The general algorithm for transforming our mapping expressions to tgds Source-to-target dependencies Σst: m1: Patient ( ssn, nm, ct, bd)Æ ∃ X ∃ Y E21Person(X) ∧ P1isidentifiedby(X, ssn) ∧ E42ObjectIdentifier(ssn) ∧ P47isidentifiedby(X, nm) ∧ E41Appellation(nm) ∧ P76hascontactpoint(X, ct) ∧ E45Address(ct) ∧ P98broughtintolife(X, Y) ∧ E67Birth(Y) ∧ P4hastimespan(Y, bd) ∧ E52Timespan(bd) m2: Sample (snm, ow, d, dt)Æ ∃ Q ∃ V ∃ W ∃ P E20BiologicalObject(Q) ∧ P47isidentifiedby(Q, snm) ∧ E42ObjectIdentifier(snm) ∧ P3hasnote(Q, d) ∧ E62String(d) ∧ P108Bhasproduced(Q, V) ∧ E8012SampleTaking(V) ∧ P4hastimespan(V, dt) ∧ E52Timespan(dt) ∧ P108Bfromlocation(V, W) ∧ A3BodyPart(W) ∧ P0Bears(W,P) ∧ E21Person(P) ∧ P1isidentifiedby(P, ow) ∧ E41ObjectIdentifier(ow) m3: Tumors(i, o, q) Æ ∃ L ∃ H ∃ B ∃ S E20BiologicalObject(L) ∧ P1isidentifiedby(L, i) ∧ E42ObjectIdentifier(i) ∧ P108Bhasproduced(L, H) ∧ E8012SampleTaking(H) ∧ P108Bfromlocation(H, B) ∧ A3BodyPart(B) ∧ P0Bears(B,S) ∧ E21Person(S) ∧ P47isidentifiedby (S, o) ∧ E42Appellation (o) ∧ P43hasdimension(L, q) ∧ E54Dimension(q)

Fig. 5. The tgds produced and a solution J

Empowering Provenance in Data Integration

277

Having established the s-t tgds we need to identify what is a solution for source I under M. We say that J is a solution for I under M if J is a finite target instance such that (I, J) satisfies Σst ∪ Σt. In other words, (I, J) satisfies the schema mapping M. One solution J for our source I presented in Fig. 1, under the source-to-target dependencies Σst of Fig. 5, is shown in Fig. 6. The solution J may contain labeled nulls. In Fig. 6 the xi, yi, qi, vi, wi, li, hi, bi, si are labeled nulls. Distinct labeled nulls are used to denote possibly different unknown values in the target instance. Let K and K’ be two instances. We say that h is homomorphism from K to K’, denoted as h: KÆ K’, if h maps the constants and labeled nulls of K to the constants and labeled nulls of K’ such that h(c) = c for every constant c and for every tuple R (t) of K, we have that R (h (t)) is a tuple of K’. We also say that h is a homomorphism from a formula φ(x) to an instance K, denoted as h: φ(x)Æ K, if h maps the variables of x to constants or labeled nulls in K such that for every relation atom R(y) that occurs in φ(x), we have that R (h (y)) is a tuple in K. E21Person x1 x2 x3 p1 p2 s1

P1isidentifiedby E42ObjectIdentifier P4hastimespan E52Timespan x1 1048 1048 y1 25-09-1981 25-09-1981 x2 4356 4356 y2 18-05-1965 18-05-1965 x3 5325 5325 y3 23-04-1956 23-04-1956 q1 PAGNH1 PAGNH1 v1 22-05-2007 22-05-2007 q2 PAGNH2 PAGNH2 v2 13-04-2006 13-04-2006 p1 4356 SURG1 p2 4356 l1 SURG1 E42Appelation P76hascontactpoint E20BiologicalObject P0Bears P47isidentifiedby x1 John Smith John Smith x1 Heraklion q1 w1 p1 x2 Angela Samiou Angela Samiou x2 Athens q2 w2 p2 x3 Cristine Dallas Cristine Dallas x3 Heraklion l1 b1 s1 s1 John Malkovits John Malkovits

E45Address P98broughtintlife E67Birth E54dimension P108Bfromlocation Heraklion x1 y1 y1 2 v1 w1 Athens x2 y2 y2 v2 w2 x3 y3 y3 h1 b1

A3BodyPart w1 w2 b1

E62String P43hasdimension P108Bhasproduced P3hasnote E9012SampleTaking q1 “Cancer Tissue” “Cancer tissue” l1 2 q1 v1 q2 “Frozen Sample” “Frozen sample” q2 v2

v1 v3

Fig. 6. One solution J

In general, there are many possible solutions for I under M. A universal solution J for I under M has the property that it is a solution and it is the most general in that there is a homomorphism form J to every solution for I under M. It was shown [15] that the result of chasing I with Σst ∪ Σt is a universal solution. Moreover the notion of the certain answers in indefinite databases is also adopted for the semantics for query answering in data exchange.

278

H. Kondylakis, M. Doerr, and D. Plexousakis

3 Retrieving Provenance Information Having established the formal framework we proceed further to define the provenance information we want to retrieve. Consider for example that a doctor wants to retrieve the identifier of all biological objects as long as their owner’s names. The answer to that question is {“Angela Samiou – PAGNH1”, “Angela Samiou – PAGNH2”, “John Malkovits – SURG1”}. Since we notice two different identifier schemes for the hospitals we would like to know from where each specific datum is coming i.e. the source provenance of each specific datum. Moreover, we would like to know how the answer was computed from the source database, in order to ensure that no wrong answer was produced. This is what we call tuple provenance. Finally we would like to know any additional information some domain expert might has added concerning the SURG1 biological object. This is what we call annotation provenance. In order to retrieve that information, users/tools need to issue a question in an appropriate language such as SPARQL capable of retrieving ontology instances. Whether SPARQL or some other query language, the query should contain a set of triple patterns called a basic graph pattern. Triple patterns are like RDF triples except that each one of the subject, predicate and object may be a variable. A basic graph pattern matches a subgraph of the ontology when terms from that sub-graph may be substituted for the variables and the result is a graph equivalent to this sub-graph. The basic graph pattern for the previous question is: “E41Appelation(x) – P1isidentifiedby(x, y) – E21Person(y) – P9Bears(y, z) – A3BodyPart (z) – P108fromlocation (z, w) – E8012SampleTaking (w) – P108hasproduced (w, u) – E20BiologicalObject (u) – P47isidentifiedby (u, s) – E42ObjectIdentifier(s)”. But let’s try to identify how the result “Angela Samiou – PAGNH1” has been produced using only the tgds from Fig. 6 and the source instance I from Fig. 1. “Angela m1 Samiou” instance has been produced via s2 ⎯⎯→ E41Appelation (“Angelam2 Samiou”) and “PAGNH1” via s4 ⎯⎯ ⎯ → E42ObjectIdentifier (“PAGNH1”), so the result actually has been produced from the tuples s2 and s4 of the source database Hospital 1. In particular, the previous example shows that s2 satisfies the tgd m1 and that s4 satisfies tgd m2. More specifically we use and extend the notion of “satisfaction step” as defined in [16] and then we define the tuple provenance.

Definition 1 (Satisfaction step). Let σ be a tgd ∀ xφ(x) Æ ∃ yψ(x, y). Let K and K1 be instances such that K contains K1 and K satisfies σ. Let h be a homomorphism from φ(x) ∧ ψ(x, y) to K such that h is also a homomorphism from φ(x) to K1. We say that σ can be satisfied on K1 with homomorphism h and solution K, or simply σ can be satisfied on K1 with homomorphism h, if K is understood from the context. The result of satisfying σ on K1 with homomorphism h is K2 where K2 = K1 ∪ h(ψ(x, y)) and h(ψ(x, y)) = { R(h(z))| R(z) is a relation atom in ψ(x, y). We denote this step as σ ,h

K1 ⎯⎯→ K2. m1 In the example described earlier where s2 ⎯⎯→ E41Appelation(“AngelaSamiou”), the m1, h1 first satisfaction step is ({s2},0) ⎯⎯⎯ → ({s2},{E41Appelation(“Angela Samiou”)}) where h1={nmÆ”PAGN1”, owÆ4356, dÆ”Cancer Tissue”, dtÆ22-05-2007, QÆq1,

Empowering Provenance in Data Integration

279

VÆv1, WÆw1, PÆp1} The result of satisfying m1 on the instance ({s2},0) with homomorphism h1 and solution J of Fig. 6 is ({s2},{E41Appelation(“Angela Samiou”)}). Now we are ready to describe the ComputeH algorithm shown in Fig. 7. The ComputeH algorithm tries to find possible assignments, possible (σ, h) pairs under the tuple t. As an example, consider that we call ComputeH (I, J,”Angela Samiou”, m1) where I, J and m1 are from Fig 1, 6 and 5 respectively. Using the E41Appelation (“AngelaSamiou”) relation atom in line 5 ComputeH defines u1 as {nmÆ “Angela Samiou”}. When u1 is applied to the left hand side of m1, we obtain the partially instantiated relational atom Patient (ssn, “Angela Samiou”, ct, bd). Hence the assignment u2 from line 6 is {ssnÆ4356, ctÆ”Athens”, bdÆ18-05-1965}. With u1 ∪ u2 the left-hand-side of m1 corresponds to the tuple s2 in the Patients relation and the right-hand-side of m1 is the conjunction of the tuples E21Person(x2), P1isidentifiedby(x2, 4356), E42ObjectIdentifier(4356), P47isidentifiedby(x2, “Angela Samiou”), E41Appellation(“Angela Samiou”), P76hascontactpoint(x2, “Athens”), E45Address(“Athens”), P98broughtintolife(x2, y2), E67Birth(y2,z2), P4hastimespan(z2, 18-05-1965), E52Timespan(18-05-1965). So, line 7 returns u3 as {XÆx2, YÆy2}. The algorithm then returns u1 ∪ u2 ∪ u3 (line 8). Algorithm 4.1: ComputeH (I, J, t, σ) Input: A source instance I, a solution J for I under M, a tuple t ∈ J, of the form R(a), and a tgd σ

( ∀ xφ(x)Æ ∃ yψ(x, y)) in Σst. Output: An assignment h such that h(φ(x)) ⊆ I, h(ψ(x, y)) ⊆ J and t ∈ h(ψ(x, y)) 1. Let R(z) be a relational atom of ψ(x, y). 2. If no such relational atom can be found 3. Return failure 4. else 5. Let u1 be a mapping that assigns the i-th variable of z to the i-th value of a in R(a). 6. Let u2 be an assignment of variables in u1(φ(x)) to values in I so that u2(u1(φ(x))) ⊆ K. 7. Let u3 be an assignment of variables in u2(u1(ψ(x, y))) to values in J so that u3( u2(u1(ψ(x, y)))) ⊆ J. 8. Return u1 ∪ u2 ∪ u3

Fig. 7. The ComputeH algorithm Algorithm 4.2: TupleProvenance (I, J, ri) Input: A source instance I, a solution J for I under M, a result ri of a question with v1,..,vj values (instances). Output: A set of tuples S={s1, s2, …, sn}from the source instance I. 1. Initialize S= {}, MP= {}, V= {v1, .., vj} 2. While V ≠ {} 3. W= v1, V= V-v1 4. For every s-t tgd σ not in MP and assignment h such that h is a possible assignment returned by ComputeH (I, J, W, σ) 5. MP= MP ∪ σ, 6. S=S ∪ RHS (h (σ)) 7. V=V ∪ LHS (h (σ)) - v1 //add the conjunct instances to V 8. Return S

Fig. 8. The TupleProvenance algorithm

280

H. Kondylakis, M. Doerr, and D. Plexousakis

Now it is time to formally describe what tuple provenance is: Definition 2 (Tuple provenance). Let M=(S, T, Σst, Σt) be a schema mapping, I a source instance, and J be a solution of I under M. Let ri ⊆ J (ri ≠ ∅ ) be the i-th result of the answer to a question Q with basic graph pattern G. The tuple provenance for ri with M, I and J (in short the tuple provenance for ri) is a set of tuples s1,s2,…,sn where each si is the beginning of a satisfaction step (si, 0)

mi,hi ⎯⎯ ⎯→ ( si , ti ), ti ⊆ G , mi ∈ Σst.

We show next an algorithm for computing the tuple provenance for a result ri. The algorithm is named TupleProvenance, is shown in Fig. 8, and makes use of ComputeH algorithm shown in Fig. 7. Our algorithm returns the source tuples that contributed to the specific answer trying to restructure the basic graph pattern of the query using the tgds. Consider for example that we want to find the tuple provenance for the result “Angela Samiou – PAGNH1”. In line 1 we initialize S= {}, MP= {}, V= {E41Appellation (Angela Samiou), E42ObjectIdentifier (PAGNH1)}. Then, in line 3, we set W= E41Appellation (Angela Samiou) so V becomes V= {E42ObjectIdentifier (PAGNH1)} and we look in M to find a tgd σ and an assignment h such that h is a possible assignment returned by ComputeH. Such a tgd is m1 so in lines 5, 6, 7 we get MP=MP ∪ {m1}, S=S ∪ Patient(4356, Angela Samiou, Athens, 18-05-1965) and V={ E42ObjectIdentifier (PAGNH1)} ∪ { E21Person(x2), P1isidentifiedby(x2, 4356), E42ObjectIdentifier(4356), P47isidentifiedby (x2, Angela Samiou), , P76hascontactpoint(x2, Athens), E45Address(Athens), P98broughtintolife(x2, y2), E67Birth(y2, z2), P4hastimespan(z2, 18-05-1965), E52Timespan(18-05-1965)} respectively. Then we go to line 3 again and now W= E42ObjectIdentifier (PAGNH1) and V={ E21Person(x2), P1isidentifiedby(x2, 4356), E42ObjectIdentifier(4356), P47isidentifiedby(x2, Angela Samiou), P76hascontactpoint(x2, Athens), E45Address (Athens), P98broughtintolife(x2, y2), E67Birth(y2,z2), P4hastimespan(z2, 18-051965), E52Timespan(18-05-1965)}. Then we look in M to find a tgd σ and an assignment h such that h is a possible assignment returned by ComputeH. Such a tgd is m2 so we get MP={m1, m2}, S= S ∪ Samples(PAGNH1,4356,Cancer Tissue, 22-052007) and V=V ∪ {E20BiologicalObject(q1), P47isidentifiedby(q1,PAGNH1) ,P3hasnote(q1, Cancer Tissue), E62String(Cancer tissue). . .}. Then the algorithm tries to find assignments for each one of the remaining elements in V using the remaining tgds, but the ComputeH always returns failure. So the S= {Patient (4356, Angela Samiou, Athens, 18-05-1965), Samples (PAGNH1, 4356, Cancer Tissue, 2205-2007)} is returned to the user. Consider now that we want to retrieve any annotation available for the returned result “Angela Samiou – PAGNH1” by the example query. We first define what Annotation provenance is and then we show the corresponding algorithm. Definition 3 (Annotation Provenance). Let M=(S, T, Σst, Σt) be a schema mapping, I a source instance, and J be a solution of I under M. Let ri ⊆ J (ri ≠ ∅ ) be the i-th result of the answer to a question Q with basic graph pattern G. The annotation provenance for ri with M, I and J (in short the tuple provenance for ri) is a set of strings st1, st2, …, stn where each sti is the annotation stored in the XML mappings of

Empowering Provenance in Data Integration

281

the corresponding mi. Each mi is the mapping used in a satisfaction step (si, 0) mi ,hi ⎯⎯ ⎯→ ( si , ti ), ti ⊆ G , mi ∈ Σst.

In order to retrieve the annotation information we use the algorithm AnnotationProvenance, shown in Fig. 9, which is similar to TupleProvenance algorithm. The basic idea is the same with TupleProvenance but it differs to that the AnnotationProvenance algorithm does not try to find the source tuples but instead the corresponding tgds.

Algorithm 4.3: AnnotationProvenance (I, J, ri) Input: A source instance I, a solution J for I under M, a result ri of a question with v1,..,vj values (instances). Output: The set ST with the annotations. 1. Initialize S= {}, MP= {}, V= {v1, .., vj} 2. While V ≠ {} and M ≠ {} 3. W=v1, V=V-v1 4. For every s-t tgd σ not in MP and assignment h such that h is a possible assignment returned by ComputeH (I, J, W, σ) 5. MP = MP ∪ σ, M=M-σ 6. S=S ∪ RHS (h (σ)) 7. V=V ∪ LHS (h (σ)) - v1 //add the conjunct instances to V 8. ST= {} 9. For each σ in MP 10. Let MK = {M1, …, Mk} be the set of the XML mappings that produced σ 11. For every Mi in MK 12. ST=ST ∪ {Annotation information in Mi} 13. Return ST

Fig. 9. The AnnotationProvenance algorithm

Algorithm 4.4: SourceProvenance (I, J, ri) Input: A source instance I, a solution J for I under M, a result ri of a question with v1,..,vj values (instances). Output: The set DB with the databases contributing to ri(1-7 lines are the same with algorithm 4.3) 8. DB= {} 9. For each σ in MP 10. Let db be the source database name from which σ was produced. 11. DB = DB ∪ db 12. Return DB

Fig. 10. The SourceProvenance algorithm

Thus, the first 7 lines of the algorithm execute the same way with TupleProvenance. Continuing our example, when the algorithm reaches line 8 we have m1 and m2 in MP. We search then the corresponding XML mappings to retrieve the annotation provenance as it is stored there. To be more specific we retrieve all the annotation information of the elements participating in the basic graph pattern G. For example, we find that the sample “PAGNH1” is a “Breast Cancer tissue” biological object.

282

H. Kondylakis, M. Doerr, and D. Plexousakis

The previous algorithm could also be extended to provide the source provenance as well. We can simply change the lines 9-12 of the previous algorithm as shown in Fig. 9 and instead of finding the mappings from the XML documents we just get the names of the source databases that participated in the mappings. The specific algorithm is shown in Fig. 10 and in our example it retrieves that the “Hospital 1” database has returned the specific results. Definition 4 (Source provenance). Let M=(S, T, Σst, Σt) be a schema mapping, I a source instance, and J be a solution of I under M. Let ri ⊆ J (ri ≠ ∅ ) be the i-th result of the answer to a question Q with basic graph pattern G. The source provenance for ri with M, I and J (in short the tuple provenance for ri) is a set of databases db1, …,dbk where each dbi , 1 ≤ j ≤ k is the database form which the corresponding mi have been produced. Each mi is the mapping used in a satisfaction step (si, 0) ti ⊆ G , mi ∈ Σst.

mi ,hi ⎯⎯ ⎯→ ( si , ti ),

4 Complexity Evaluation In this section we demonstrate the feasibility of our solution in terms of complexity. We start by showing that the conversion of our XML mappings to tgds can be done in polynomial time in the size of XML mappings. The algorithm ComputeTgds takes as input the XML file with the mappings of a relational database to ontology and produces the corresponding tgds. It has to check each mapping once in order to identify to which table it corresponds. Having grouped all the mappings for each table we can produce easily the corresponding tgd. Thus, the algorithm executes in polynomial time in the size of the input XML mappings. The next algorithm is ComputeH which tries to find an assignment h such that h( φ(x)) ⊆ I, h( ψ(x, y)) ⊆ J and t ∈ h( ψ(x, y)). The input given to the algorithm is a source instance I, a solution J for I under M, a tuple t ∈ J, of the form R(a), and a tgd σ ( ∀ xφ(x)Æ ∃ yψ(x, y)) in Σst. In order to do that the algorithm tries to find a combination of mapping-source solution that produces the specific result from the solution. As described in the algorithm this is calculated in three simple steps. Now it is time to show the complexity of our provenance algorithms. All three of them try to rebuild the basic graph pattern of a query Q using the information from the result, the tgds, the solution J and the source instance I. The algorithms have to check for every mapping if they can find an assignment h using the result, the mapping, the solution J and the source instance I. If one such mapping exists we add the instantiated conjuncts in the result and we start over. The algorithm is guaranteed to terminate since for every σ there are finite conjuncts and finite other mappings to check. So it terminates after checking in the worst case all the conjuncts and all the mappings. So it runs in polynomial time in the size of the mappings, and the conjuncts.

5 Related Work From the early 90’s Wang et al. [17] used the pre-existing polygen model and algebra to track provenance as a form of annotations. The results of queries could carry along

Empowering Provenance in Data Integration

283

source attributions in each column of each tuple. Then, Woodruff et al. [18] first proposed the idea to retrieve fine-grained provenance into a database without using annotations. Their idea was to define weak inverses for the functions defined in their code (a weak inverse, when applied to some data element in the result of a function returns some approximation to the provenance that is associated with the function). A fundamental drawback of this technique was that the user was required to provide inverse functions and their corresponding verification functions. In [19] the data transformation engine was reengineered so that additional information about which source schema elements and mappings contributed to the creation of a target data was propagated and stored with the target data. This information could latter be queried using the MXQL querying language. In our approach we do not have to reengineer any data transformation engine. We just use the information returned by the pre-existing data transformation engines. Then, various theory and systems were developed to propagate annotations from source to the output such as [20] and DBNotes [21] but several things remain open to that direction. In general query answering in these settings involves generalizing the relational algebra to perform corresponding operations on the annotations. The first attempt at a general theory of relations with annotations appears to be [22] where axiomatized label systems are introduced in order to study containment. Lee et al. [23] presents a framework for explicitly storing provenance (which they call attribution) of data items in query results based on a mediation architecture. Coarse-grained provenance information is stored as queries are computed –it identifies which source date items were derived from, along with additional information such as timestamps and source quality. Provenance semirings [24] seems to be really promising to that direction. However, in our approach we do not store annotation information on sources but we store annotation information for sources using the mappings produced. Besides annotation propagation, several systems try to identify where- and whyprovenance without trying to retrieve annotations. More specifically why-provenance in data warehouses was studied by Cui et. al. [25] and Buneman et al. [4]. In [25] a definition was given for relational views, which also shows how to compute whyprovenance in queries in the relational algebra. Their work was then re-examined in [4] where authors give a syntactic characterization of why-provenance and show that it is invariant under query rewriting. They also show that where-provenance is problematic and cannot in general be expected to be invariant under query rewriting. However, their approach to compute provenance is via syntactic analysis of the query. In our approach however we do not try to return provenance information based on query rewritings but based on the result returned by that query. Moreover in SPIDER, [16] provenance information called route is used to describe relationships between source and target data in a data integration scenario. However the purpose here is to understand and debug the specifications of the integration system and to correct the schema mappings. The information produced is calculated on a tuple from a Solution J. However out proposal is not for debugging schema mappings, but for returning provenance information. We can compute provenance based on a result of some query instead of some tuples from a Solution J. Moreover, we can return annotation information as well, that is being stored on schema mappings.

284

H. Kondylakis, M. Doerr, and D. Plexousakis

6 Conclusion and Future Work We tried to respond to all three problems Buneman reported for data integration. We started by describing a simple and intuitive mapping language and we showed that the mappings produced can be transformed to s-t tgds extending the traditional data exchange setting. The mapping language is capable of storing annotation information for the source schemata that can be retrieved efficiently in polynomial time. Moreover, we showed that we can use those mappings to retrieve the source tuples that contributed to the result of a specific answer. To the best of our knowledge no other provenance framework is capable of presenting source, tuple and annotation provenance information. Our system does not require additional information to be stored on sources; neither changes the underlying engine, nor uses a specialized query language. Furthermore, none of the previous solutions present a complete provenance tracking solution for general relational to ontology case. An interesting future extension would be to extend our framework to include web annotated databases. Moreover, it would be useful to check whether and how we can apply our algorithms to other scenarios such as security, dataflow, and extracttransform-load. Finally trying to provide archive provenance is a future direction widely recognized [26]. Databases and schemas evolve over time. A complete record of provenance should entail archiving all past states of the evolving database so that it would become possible to trace the provenance of data to the correct version of the database or trace the flow of data in a version of the database that is not necessarily the most recent.

References 1. Buneman, P.: Information Integration Needs a History Lesson. University of Edinburgh, Edinburgh (2006) 2. Buneman, P., Cheney, J.: On the Expressiveness of Implicit Provenance in Query and Update Languages. ACM Transactions on Database Systems V, 1–45 (2008) 3. Glavic, B., Dittrich, K.R.: Data Provenance: A Categorization of Existing Approaches. In: BTW (2007) 4. Buneman, P., Khanna, S., Tan, W.C.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2001) 5. Uschold, M., Gruninger, M.: Ontologies: Principles, methods and applications. Knowledge Engineering Review 11, 93–155 (1996) 6. Konstantinou, N., Spanos, D.-E., Mitrou, N.: Ontology and database mapping: A survey of current implementations and future directions. Journal of Web Engineering 7, 1–24 (2008) 7. Auer, S., Ives, Z.G.: Integrating Ontologies and Relational Data. University of Pennsylvania Department of Computer and Information Science Technical, Report No. MS-CIS07-24 (2007) 8. Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the twentyfirst ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, Madison (2002)

Empowering Provenance in Data Integration

285

9. Doerr, M., Ore, C.-E., Stead, S.: The CIDOC conceptual reference model: a new standard for knowledge sharing. Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling, vol. 83, Australian Computer Society, Inc., Auckland (2007) 10. Klein, M.: Combining and relating ontologies:an analysis of problems and solutions. In: IJCAI (2001) 11. Doan, A., Noy, N.F., Halevy, A.Y.: Introduction to the special issue on semantic integration. ACM SIGMOD Record 33, 11–13 (2004) 12. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18, 1–31 (2003) 13. Choi, N., Song, I.-Y., Han, H.: A survey on ontology mapping. SIGMOD Record 35, 34– 41 (2006) 14. Kondylakis, H., Doerr, M., Plexousakis, D.: Mapping Language for Information Integration. FORTH-ICS, Technical Report 385, ICS-FORTH (December 2006) 15. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theoretical Computer Science 336, 89–124 (2005) 16. Chiticariu, L., Tan, W.-C.: Debugging schema mappings with routes. In: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, Seoul (2006) 17. Wang, Y.R., Madnick, S.E.: A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. In: Proceedings of the 16th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco (1990) 18. Woodruff, A., Stonebraker, M.: Supporting Fine-grained Data Lineage in a Database Visualization Environment. In: Proceedings of the Thirteenth International Conference on Data Engineering. IEEE Computer Society, Los Alamitos (1997) 19. Velegrakis, Y., Miller, R.J., Mylopoulos, J.: Representing and Querying Data Transformations. In: Proceedings of the 21st International Conference on Data Engineering. IEEE Computer Society, Los Alamitos (2005) 20. Buneman, P., Khanna, S., Tan, W.-C.: On propagation of deletions and annotations through views. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, Madison (2002) 21. Tan, W.C.: Containment of relational queries with annotation propagation. In: Workshop on Database and Programming Languages, pp. 37–53 (2003) 22. Ioannidis, Y.E., Ramakrishnan, R.: Containment of conjunctive queries: beyond relations as sets. ACM Trans. Database Syst. 20, 288–324 (1995) 23. Lee, T., Bressan, S., Madnick, S.E.: Source Attribution for Querying Against Semistructured Documents. In: Workshop on Web Information and Data Management (1998) 24. Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, Beijing (2007) 25. Cui, Y., Widom, J.: Practical Lineage Tracing in Data Warehouses. In: Proceedings of the 16th International Conference on Data Engineering. IEEE Computer Society, Los Alamitos (2000) 26. Tan, W.C.: Provenance in Databases: Past, Current, and Future. IEEE Data Eng. Bull. 30, 3–12 (2007)

Detecting Moving Objects in Noisy Radar Data Using a Relational Database Andreas Behrend1 , Rainer Manthey1 , Gereon Sch¨ uller2 , and Monika Wieneke2 1

2

University of Bonn, Institute of Computer Science III, R¨ omerstraße 164, 53117 Bonn, Germany {behrend,manthey}@cs.uni-bonn.de FGAN e.V., Dept. FKIE-SDF, Neuenahrer Straße 20, 53343 Wachtberg, Germany {schueller,wieneke}@fgan.de

Abstract. In moving object databases, many authors assume that number and position of objects to be processed are always known in advance. Detecting an unknown moving object and pursuing its movement, however, is usually left to tracking algorithms outside the database in which the sensor data needed is actually stored. In this paper we present a solution to the problem of eﬃciently detecting targets over sensor data from a radar system based on database techniques. To this end, we implemented the recently developed probabilistic multiple hypothesis tracking approach using materialized SQL views and techniques for their incremental maintenance. We present empirical measurements showing that incremental evaluation techniques are indeed well-suited for eﬃciently detecting and tracking moving objects from a high-frequency stream of sensor data in this particular context. Additionally, we show how to efﬁciently simulate the aggregate function product which is fundamental for combining independent probabilistic values but unsupported by the SQL standard, yet.

1

Introduction

The research area moving object databases is concerned with the management of objects that continuously change their spatiotemporal extent [28]. In particular, the tracking of object movements and the representation of uncertainty with respect to object positions are fundamental problems in spatiotemporal applications. The detection of an unknown moving object and its tracking, however, is usually left to external tools outside the database in which radar data is actually stored. The reason for this is twofold: On the one hand, SQL queries seem to be inappropriate for implementing probabilistic computations occurring in tracking algorithms based on hypothesis testing. On the other hand, tracking algorithms are usually applied to fast changing sensor data which would imply a costly reevaluation of SQL queries employed for track analysis. In this paper, we present an eﬃcient solution to implementing a tracking algorithm directly in SQL applied to a stream of sensor data stored in a conventional database. We have chosen the probabilistic multiple hypothesis tracking J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 286–300, 2009. c Springer-Verlag Berlin Heidelberg 2009

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

287

approach (PMHT) [21] for pursuing multiple targets. These military targets are moving objects whose tracks are to be determined in a cluttered environment by using a time-series of (potentially inaccurately) measured positions coming from a radar system. For diﬀerentiating false from true tracks of an unknown number of moving targets, we consider an extended version of the PMHT framework including a sequential likelihood test as proposed in [26,27]. A description of PMHT will be given in Section 2. Since the sensor data to be analyzed are stored and managed by a relational database system, it seems to be reasonable to perform a large amount of the underlying computation using SQL queries directly deﬁned over the stream of measured data. The resulting statements can be considered continuous queries of considerable complexity which have to be re-evaluated as soon as new data arrive. It is widely believed that conventional database relational databases systems are not well-suited for dynamically processing continuous queries and various specialized stream processing engines have been proposed for their evaluation in the literature [1,2,3,17]. We believe, however, that even conventional SQL queries can be eﬃciently employed for analyzing a wide spectrum of data streams. In our approach, continuous queries are represented as materialized SQL views and the recomputation of probability values is carried out by using specialized update statements activated by triggers referring to changes in the underlying tables (cf. Section 5). Since a great portion of the materialized view content remains unchanged, the application of delta views considerably enhances the eﬃciency of view maintenance. Update propagation (UP) is not a new research topic but has been intensively studied for many years mainly in the context of integrity checking and materialized views maintenance [9,13,14,16]. The application of UP for analyzing data streams, however, has not attracted much attention so far but there is ﬁrst practical evidence for its feasibility in this context [6]. The importance of this approach is underlined by the fact that UP plays an important role in the context of the continuous query extension proposal currently announced for Oracle [22]. This paper demonstrates the feasibility of an SQL-based implementation of probabilistic hypothesis testing. In contrast to external approaches, our ﬂexible declarative solution is easily modiﬁable, allows for synchronized multi-user access, and is robust against system failures because of the integrated transaction management. Another goal of the paper is to present the performance gain to be achieved by using incremental techniques for analyzing data streams. We show that the combination of conventional database techniques represents an elegant and feasible way towards solving a considerable spectrum of practical stream analysis problems. The achieved performance gain, shown in Section 6, fully scales with the amount of data stored and allowed to considerably increase the updating frequency. Additionally, UP allows to eﬃciently simulate the aggregate function product which is essential in many probabilistic computations but not yet supported by the SQL standard and most commercial database systems. Our paper supports the claim that the incremental evaluation of SQL views provides a suitable approach for analyzing a wide spectrum of data stream applications.

288

2

A. Behrend et al.

Probabilistic Hypothesis Testing

Tracking in general is the problem of deriving a complete track of a moving object from a time-series of measured positions. Because of inaccurate measurements or clutter in sensor data, however, it is not guaranteed that a target of interest is detected at each point in time. When multiple targets are present, the association between measurements and targets is unknown, too. The numerous possible track associations lead to a multitude of diﬀerent assignment hypotheses. Therefore, solving the assignment problem is the central task of every tracking algorithm pursuing multiple targets. The traditional approaches to multiple hypothesis tracking rely on a complete enumeration of all possible interpretations of a series of measurements and avoid exponential growth of the arising hypothesis trees by various approximations [5,7]. A powerful alternative is the probabilistic multiple hypothesis tracking approach (PMHT) [21,26]. Essentially, PMHT is based on Expectation Maximization (EM) [10,23] for handling assignment conﬂicts. The PMHT approach employs a sliding data window (also called batch) and exploits the information of previous and following scans in each of its kinematic state estimates. PMHT is a computationally eﬃcient algorithm because its memory usage remains linear in all parameters. As a target oriented approach, however, it requires the number of targets in the surveillance area to be a priori known, and it expects all tracks to truly exist. This requirement is impossible to be satisﬁed in almost every realistic scenario. In real world applications, new targets appear from time to time in the ﬁeld of view. They have to be recognized as a target and their initial kinematic states have to be estimated. On the other hand, a target may leave the surveillance area and thus has to be removed from the list of established tracks. In order to determine the actual number of true target tracks, the PMHT framework has been extended by a sequential likelihood-ratio test in [27]. 2.1

The PMHT Algorithm

The sensor output Zt at time t consists of the set of measurements z t and of the number of measurements Nt . A sensor generates a series of measurements Z = Z1:T = {z t , Nt }Tt=1 for a time interval [1 : T ]. Let S be the number of targets in the surveillance area. The task of tracking is to estimate the kinematic T states X = X1:T = { xst }Ss=1 t=1 of the observed targets where the state x = (x, x, ˙ y, y) ˙ represents position x, y and velocity x, ˙ y˙ in the two dimensions. Each target moves from xst to xst+1 according to the following discrete-linear equation xst+1 = Fxst + v st

(1)

where v st is a vector of probabilistic values describing the intrinsic noise in moving. F is the state transition matrix that describes the moving and acceleration behaviour of the objects. The true value of xst , however, is unknown - just measured observations z nt are provided by the radar system. The following equation represents the assumed connection between these values z nt = Hxst + wst

(2)

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

289

probability

0.15 0.1 0.05 0 6

4

2 y

0

−2

−2

0 x

2

4

Fig. 1. Visualization of the uncertainty in object position: The object is located somewhere in the 2D-plane while the exact position is Gaussian distributed. The z-axis gives the probability for each position,forming a bell-shaped curve. The expectation value 1 10 s Fxt is , the covariance Q is . 2 01

where the random sequence wst denotes the noise added by the sensor. Both,v st and wst are assumed to be white, zero-mean, Gaussian, and mutually independent, with covariance matrices Q = E{v st , v st T } and R = E{wst , wst T } for all values s and t. H is the observation matrix and describes how the real position xst is transformed to the observation z nt . This transformation is linear since our measurements are Cartesian. In Figure 1, our approach to represent uncertainty in object positions is visualized. Diﬃculties arise from unknown assignments A = A1:T = {at }Tt=1 of measurements to targets. The assignments are repret sented by random variables at = {ant }N n=1 that map each measurement number n ∈ [1 : Nt ] to one of the target number s ∈ [1 : S] by assigning ant = s. For determining the assignments, we have to solve the optimization problem arg maxX p(X |Z) which means to ﬁnd the kinematical states that have the highest probability given a series of measurements. An eﬃcient method for solving this problem is Expectation Maximization (EM) which ﬁnally leads to the iterative PMHT algorithm as proposed in [21]. Let l be the number of the current iteration. Each iteration consists of two steps: Expectation and Maximization. Expectation: In this phase we calculate posterior assignment weights wtns (l) := p(ant = s|z nt , xst (l)) representing the probability that a measurement z nt refers to the target s: π ns N (z nt ; Hxst (l), R) wtns (l) = S t . (3) ns n s πt N (z t ; Hxt (l), R) s =0

The weights are calculated for all scans of the current data window and for all targets with respect to all measurements of a certain scan. The expression N (y; μ, Σ) denotes the multivariate Gaussian density with random variable y,

290

A. Behrend et al.

expected value μ and covariance Σ. Each weight is governed by the distance between a particular measurement z nt and the current state estimate xst (l). The value πtns := p(ant = s) denotes the prior probability that a measurement belongs to a target. Target s = 0 is ﬁctitious, representing clutter. ¯ st (l) of all Subsequently, the weights are used to form the weighted sum z measurements which leads to one synthetic measurement per target and a corresponding synthetic covariance for each time t: Nt

¯ st (l) z

=

wtns (l) z nt

n=1 Nt

n=1

wtns (l)

¯ s (l) = R t

R Nt n=1

(4)

wtns (l)

We assume the kinematic states satisfying the Markov property; that is, position and velocity at time point t depends only upon position and velocity at time point t − 1: p(xt |xt−1 , xt−2 , . . . , x0 ) = p(xt |xt−1 )

(5)

Combining the two equations in 4 using this assumption and calculating the product over all measurements, we get the probability of an assignment A for a given series of measurements Z and the iterated series of kinematic states X l : P (A|Z, X ) = l

Nt

n n lat n n=0 p(zt |xt )p(at |Nt ) n Nt la n t n t=1 at n=0 p(zt |xt )p(at |Nt )

T

(6)

lan nan Nt T N znt ; Hxt t , Rnt πt t lnan

= =: wt t n n Nt S la na n t n t t=1 t=1 n=0 n=0 s=0 N zt ; Hxt , Rt πt T

Nt

n=0

Maximization: In this phase each target track is updated by means of a ¯ st over Kalman Smoother that smoothes the tracks given by the synthetic values z s a time period [1 : T ] [5]. This leads to new, improved state estimates x1:T (l + 1) for each target s. The expectation and maximization phase are repeated until the state estimates do not considerably change anymore (convergence). 2.2

Sequential Likelihood-Ratio Testing via PMHT

The classical PMHT algorithm is not able to handle an unknown and constantly changing number of targets in the ﬁeld of view (FoV). Therefore, the incorporation of a sequential likelihood-ratio test (LR) has been proposed by [27]. A sequential LR test is a statistical method that successively updates the ratio between two competing hypotheses. During each scan, the LR is compared with an upper and a lower threshold in order to decide which of the two hypotheses is more plausible: LR1 (t) =

l(H1 ) p(Z1:t |H1 ) p(z t |Z1:t−1 , H1 ) = = · LR(t − 1) l(H0 ) p(Z1:t |H0 ) p(z t |Z1:t−1 , H0 )

(7)

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

291

For the purpose of track extraction we choose H1 as the hypothesis that one target exists in the FoV, and H0 as the opposite hypothesis that all measurements are false. The aim is to decide as quickly as possible between H1 and H0 (target or no target). To this end, the value LR1 (t) is compared with two predeﬁned thresholds A and B for each scan t. – If LR1 (t) ≤ A, hypothesis H0 is accepted to be true. – If LR1 (t) ≥ B, hypothesis H1 is accepted to be true. – Otherwise the algorithm cannot yet decide and has to wait for the measurements z t+1 of the next scan to test LR1 (t + 1). This general scheme was ﬁrst proposed by Wald [25]. The user has to preset the reliability of the algorithm by determining the thresholds A and B. Therefore, they have to set the related statistical decision errors to P1 := Prob( accept H1 | H1 ) and P0 := Prob( accept H1 | H0 ). P1 is the probability to correctly identify a really existing target. P0 , on the other hand, is the probability to wrongly assume the existence of a target that does not exist. The thresholds A and B depend on errors P1 and P0 as follows: A≈

1 − P1 1 − P0

B≈

P1 P0

(8)

The smaller the permitted error, the longer the user has to wait for a decision. For example, if P1 is chosen close to unity and P0 is chosen close to zero (corresponding to a certainty near 100%), inﬁnite runtime would be observed. For integrating a sequential LR test into the PMHT framework, the following approximative formula has been derived showing how the ratio of scan t − 1 is connected with the ratio of scan at time t: ¯ t ) · |FoV| + π ¬d · p(Nt |H1 ) · LR1 (t − 1) , (9) LR1 (t) ∝ πtd N (¯ z t ; Hxt|t−1 , S t pF (Nt ) where πtd and πt¬d are the prior probabilities of detecting and not detecting ¯ t is the synthetic measurement of the target after the the target, respectively. z ¯ t denotes the synthetic innovation covariance after last PMHT iteration, and S the last PMHT iteration. Hence, the distance between the synthetic measurement and the current state estimate of the (candidate) target is exploited as a measure of target existence. In case of a false track, which is based on clutter measurements only, the synthetic measurement will usually be somewhere in the surroundings, probably not very close to the current estimate. In case of a truly existing and detected target the synthetic measurement will be close to the current estimate. |FoV| is the area of the ﬁeld of view. pF (Nt ) is the probability of having Nt false measurements, which is assumed to be Poisson distributed. For p(Nt |H1 ) we obtain the following case diﬀerentiation

pF (0)P¬D Nt = 0 p(Nt |H1 ) = (10) pF (Nt )P¬D + pF (Nt − 1)PD Nt ≥ 1, with detection probability PD and P¬D := 1 − PD .

292

3

A. Behrend et al.

Implementing PMHT in SQL

In this chapter, we ﬁrst present a rather straightforward implementation of the extended PMHT approach in SQL which will be improved later on. The reason for using SQL is twofold: On the one hand, we want to show that probabilistic hypotheses tracking can be performed directly over a moving object database. On the other hand, we want to beneﬁt from managing sensor data in a centralized database system because of the following reasons: – The amount and size of data sets to be processed is substantial such that the improved access methods of a DBMS are advantageous. – In future applications, we intend to process data coming from diﬀerent sources, e. g. other radar stations, laser scanners, sonar sensors etc. These data streams will have to be synchronized, and access conﬂicts to be resolved. – After system failures, the system ought to ”roll back” to a consistent state. – Results ought to be shared among multiple users. – Persistent storage of sensor data may be indispensable as, for example, in military scenarios the history of actions must be documented. As a DBMS is able to handle all these issues, it seems to be meaningful to perform a large amount of the computation needed directly over a database using SQL. Suppose positions and velocities detected by the sensor are stored in a table measurements containing positions x, y, velocity components vx, vy, time t, and a measurement number n. The following view measCount calculates the value of Nt which has been introduced in Subsection 2.1 and is employed, e.g., in Equations 9 and 10: CREATE VIEW measCount AS SELECT t, COUNT(*) AS Nt FROM measurement GROUP BY t The calculation of prior probabilities πtd =p11 and πt¬d =p10 from Equation 10 is implemented using the view pi: CREATE VIEW pi AS SELECT t, IF(NT>1,PD+(PF*(1-PD)/Nt)/(2*PD+PF*(1-PD)/Nt)/FoV, Table 1. Table measurements t 1 2 3 3 4

n 1 1 1 2 1

x -0.00464 165.00612 329.97203 100.93443 495.00646

vx 5.51584 5.58467 5.55485 0.34235 5.53123 ...

y 49.99179 215.00303 379.99141 200.42435 545.00264

vy 5.55454 5.53110 5.50254 0.23233 5.55647

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

FROM

293

PD+(PF*(1-PD))/(PD+PF*(1-PD)))/FoV AS p10, IF(NT>1,PD/(2*PD+PF*(1-PD)/NT), PD/(PD+PF*(1-PD))) AS p11 PMHT, measCount

It uses the simpliﬁed condition statement IIF as allowed in many commercial SQL systems. The table settings is a single-tuple relation storing the predeﬁned sensor characteristics PF and PD . The view pi can now be employed to calculate the Gaussian distribution of the estimation batch which is the ﬁrst factor of Equation 9: CREATE VIEW ewGauss AS SELECT t, est.n, log(GAUSS(m.x,e.x,m.y,e.y)*pi.p11+pi.p10) AS lr FROM measurement AS m, estimate AS e, pi WHERE m.t=est.t AND pi.t=m.t The function GAUSS is a user-deﬁned function, implementing the multivariate ¯ Gauss Distribution N (¯ z , Hx, S): CREATE FUNCTION GAUSS(x DOUBLE,mx DOUBLE,y DOUBLE,my DOUBLE) RETURNS DOUBLE BEGIN RETURN 1/(SQRT(2*PI()))*exp(-1/5000*(sqr(x-mx)+sqr(y-my))); END; The table estimate stores the results of the Kalman ﬁlter z¯ . The application of logarithmic values is a workaround for using pure SQL, as logarithm transforms the products to sums. This is necessary because SQL does not support product as

GLR_msrH1

Σ ewGauss

σ

f estimate

σ

pi

f

Kalman

f measCount

settings

# measurement

Fig. 2. View hierarchy of the LR-calculation. f denotes a scalar function, Σ the SUM aggregate, # the COUNT aggregate and σ the selection.

294

A. Behrend et al.

an aggregate function. Despite of this tricky solution, note that the calculation of logarithms slows down the entire process and represents no general workaround in case of non-positive arguments. In Section 5, however, we will show how to simulate the aggregate function of the mathematical product using our incremental approach. We can now implement the logarithmic likelihood-ratio from Equation 9 using aggregation and join: CREATE VIEW GLR_MsrH1 AS SELECT ewGauss2.t, sum(ewGauss.lr) AS lr FROM ewGauss, ewGauss AS ewGauss2 WHERE ewGauss.t>ewGauss2.t-3 AND ewGauss.t<=ewGauss2.t GROUP BY ewGauss2.t Figure 2 shows the view hierarchy of this calculation indicating the employed scalar or aggregate functions. Note that this is only a small extract of the entire solution but already indicates the main principles of our approach.

4

Incremental Update Propagation

The idea of update propagation is to identify the changes of the derived view data induced by updates of the underlying base tables. As in many cases, only a small portion of the view data is aﬀected by a base table update, it seems to be reasonable to incrementally compute these changes and avoid the complete recalculation of the aﬀected view. This approach seems to be especially useful in data stream systems where the complete re-computation of continuous queries is generally unfeasible. We will now brieﬂy recall propagation rules for set-operations and for aggregate functions used in our approach. 4.1

Propagation Rules

As an example, suppose the view Q is deﬁned by the relational algebra (RA) expression Q = R T . We denote the subset of inserted tuples into R by R+ and the subset of deleted tuples by R− . According to [16], the new state Qnew of relation Q resulting from a given insertion R+ can be incrementally calculated using Qnew = (R ∪ R+ ) T = (R T ) ∪ (R+ T ) = Q ∪ Q+ Given the materialized values in Q, it is suﬃcient to calculate Q+ = R+ T for determining the new state of Q. Assuming that |R+ | |R|, the incremental update of Q could accelerate the computation of Qnew from (|R| + |R+ |) ∗ |T | to |R+ | ∗ |T |. In a similar way, all other RA operators (∪,∩,\,×,σ and π) can be incrementally maintained. An overview of the corresponding UP rules can be found in [16]. SQL supports aggregate functions operating on attributes of a set of tuples, e.g., SUM, AVG, COUNT. Some of these can be calculated incrementally in a similar way. For example, the SUM-function can be incrementally maintained by adding the values of the inserted tuples t.a = t1 .a + t2 .a (11) t∈(R∪R+ )

t1 ∈R

t2 ∈R+

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

295

if no deletions are considered. Again, the idea is to decompose the aggregate function into terms which represent the aggregated value of the old state, the aggregated value of the inserted tuples, and the value of the deleted tuples. Functions that can be decomposed into such components are homomorphisms on set-union or set-diﬀerence. Most aggregate functions in SQL are homomorphisms except from MAX and MIN. For incrementally maintaining the results of an aggregate function with grouping, it has to be be checked whether an insertion falls into an existing group or whether it forms a new group. In the ﬁrst case, the corresponding group has to be updated whereas in the latter case the aggregate function has to be re-calculated on the inserted values. In case of deletions, it has to be checked whether a deleted tuple leads to the removal of a group. This check can be done by using a helper-attribute that counts the number of tuples that formed a group. If the count reaches zero, the group can be removed. Based AV G(R.a)∗COUN T (R+ .a) on that, the expression AV G(R.a) + AV G(R+ .a) − COUN T (R.a)+COUN T (R+ .a) can be used for incrementally maintaining AVG(R.a). 4.2

Invoking Update Rules

Since we propose to materialize derived data, the question arises how to apply the derived propagation rules from above. A possible way is the employment of triggers which use corresponding specialized update statements in its action parts as proposed in [9]. Before updating a materialized view this way, all underlying tables have to be already maintained in advance. For example, the view ewGauss in Figure 2 directly depends on the table measurements and the views estimate and pi. In order to be able to apply a trigger for incrementally updating ewGauss, the table and the materialized views have to be already updated. A possible solution to this problem is to stratify a given view hierarchy and to use a control trigger which ensures the successive evaluation of consecutive strata. In our approach, however, the trigger execution strategy of SQL has been directly used to ensure that the various update statements are executed in the correct order.

5

Incremental Hypothesis Testing

Since the materialized view concept is still heavily restricted in commercial database systems and not transparent to the user, we implemented our own approach which even allowed us to simulate the missing aggregate function PROD. In our scenario, there are frequent insertions (e.g. every second) into table measurement leading to induced updates in the view hierarchy of Figure 2. According to Equation 11, the view measCount can be incrementally maintained using the specialized update statement of trigger new meas: CREATE TRIGGER new_meas BEFORE INSERT ON measurement FOR EACH ROW BEGIN IF EXISTS(SELECT m.t FROM measurement AS m WHERE m.t=NEW.t) THEN UPDATE TBLmeasCount SET Nt=Nt+1 WHERE TBLmeasCount.t=NEW.t;

296

A. Behrend et al.

ELSE INSERT INTO TBLmeasCount VALUES(NEW.t, 1); END IF; INSERT INTO TBLewGauss... (see below) END;

The state variable NEW refers to the inserted tuple(s) of measurement. Before updating the dependent materialized view measCount, it is checked whether a group containing the attribute t already exists or not. In the ﬁrst case, the counter is simply incremented whereas in the latter case a new group is inserted. The same trigger is employed for updating ewGauss by additionally considering the update statement INSERT INTO TBLewGauss SELECT NEW.t,NEW.n, GAUSS2(NEW.x,e.x,NEW.y,e.y)*pi.p11+pi.p10 AS lr FROM estimate as e, pi WHERE e.t=NEW.t AND e.t=NEW.t; which is again specialized with respect to the newly inserted tuples given in NEW. Before this update is executed, however, the following trigger new pi is activated by the modiﬁcation of measCount for incrementally maintaining the prior probabilities in pi: CREATE TRIGGER new_pi AFTER INSERT ON TBLmeasCount FOR EACH ROW BEGIN INSERT INTO TBLpi SELECT NEW.t, PD+(PF*(1-PD))/(PD+PF*(1-PD))/FoV AS p10, PD/(PD+PF*(1-PD)) AS p11 FROM PMHT; END; Since we have to consider both inserts and updates with respect to relation pi, the following trigger upd pi has to be introduced in addition: CREATE TRIGGER upd_pi AFTER UPDATE ON TBLmeasCount FOR EACH ROW BEGIN DECLARE NT INT; SET NT = (SELECT NT FROM TBLmeasCount AS m WHERE New.t=m.t); UPDATE TBLpi, PMHT SET p10=IF(NT>1, PD+(PF*(1-PD)/Nt)/(2*PD+PF*(1-PD)/Nt)/FoV, PD+(PF*(1-PD))/(PD+PF*(1-PD))/FoV), p11=IF(NT>1,PD/(2*PD+PF*(1-PD)/NT), PD/(PD+PF*(1-PD))) WHERE TBLpi.t=NEW.t; END; In the last step, the aggregated table GLR_msrH1 has to be updated, which means to calculate the product of the given probabilistic values. We would like to directly calculate the product instead of using the logarithmic function as workaround. A product can be decomposed according to r∈(R∪R+ \R− ) r.t =

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

297

r.t · r+ ∈R+ r+ .t ÷ r− ∈R− r− .t . if no zero elements in R− are permitted (assumed in our approach by prohibiting zero probabilistic values). Following these considerations, the following trigger can be used for updating GLR msrH1: r∈R

CREATE TRIGGER new_GLR_msrH1 AFTER INSERT ON ewGauss FOR EACH ROW BEGIN IF EXISTS(SELECT NEW.t FROM GLR_msrH1 WHERE GLR_msrH1.t=NEW.t) UPDATE GLR_msrH1, pi SET GLR_msrH1.lr=GLR_msrH1.lr*NEW.lr WHERE pi.t=NEW.t and NEW.t>GLR_msrH1.t-3 AND NEW.t<=GLR_msrH1.t; ELSE INSERT INTO GLR_msrH1 SELECT NEW.t, GLR_msrH1.lr*NEW.lr/ewGauss2.lr AS lr FROM pi, NEW, ewGauss AS ewGauss2, GLR_msrH1 WHERE NEW-t-3-1.t=ewGauss2.t AND ewGauss2.n=NEW.n AND GLR_msrH1.t=NEW.t-1 AND pi.t=NEW.t GROUP BY NEW.t END IF; END;

The trigger uses a decomposition of Equation 9 into two parts: The factor LR(t − 1) given in variable GLR msrH1.lr represents the old likelihood-ratio whereas the new knowledge is given in NEW.lr. The condition NEW.t-3-1= ewGauss.t within the subsequent insert-statement identiﬁes the tuple that ”falls oﬀ” the sliding window and thus has to be excluded from the calculated value. In our approach, the sliding window at time t contains the three scans at times t − 2, t − 1, and t. Each of them comprises N radar measurements (cf. Equations 4 and 6). In order to accumulate all measurements of the last three scans, the clause GROUP BY NEW.t is applied within the insert statement.

6

Performance Issues

In order to show how the calculation of the likelihood-ratio could be accelerated using our incremental approach, we measured the time needed for the naive and incremental recomputation of the likelihood-ratio using diﬀerent numbers of measurements N = 1, 100, 1000, 2000 for each scan in MySQL. The data came from a rotating radar system with a turn rate of around 15 rounds per minute. Each measurement of a given scan was successively pushed into the MySQL system after the naive respectively incremental evaluation of the previous measurement had been ﬁnished. The tests were run under Windows XP and MySQL 5.0 on a computer equipped with an Intel Core 6600 processor and 2 GB of RAM. A comparison of the performance results with respect to the ﬁnally determined likelihood-ratio is presented in Figure 3. The relation between the number of tuples and the naive recomputation time can be ﬁtted well with t = 0.00034 · log10 (N ) · N 2 , giving reason to believe that t = O(N 2 · log(N )). On the other hand, the run-time of the incremental method can be ﬁtted via a linear model (t = 0.0024 · N + 0.107). This is much more than suﬃcient for our practical scenario where we have to process 400 measurements every 4 seconds due to the physical scan rate. The diﬀerence in asymptotical behaviour are due to the

298

A. Behrend et al.

Recomputation of Likelihood-Ratio

Recomputation of Likelihood-Ratio

1600

6

1400

5 4

1000

time [s]

time [s]

1200

800 600

3 2

400 1

200 0

0 0

500

1000

1500

2000

0

500

No. of tuples N

1000

1500

2000

No. of tuples N

Naive

Incremental

Fig. 3. Comparison of the two evaluation approaches with respect to the ﬁnal view GLR MsrH1. While the naive re-computation shows super-linear runtime, our incremental approach has linear runtime. Note the diﬀerent scales of the time axis. Recomputation of ewGauss

Recomputation of PI 4640

25

4060 3480

15

time [s]

time [s]

20

10

2900 2320 1740 1160

5

580

0

0 0

500

1000 No. of tuples N Naive

Incremental

1500

2000

0

500

1000

1500

2000

No. of tuples N Naive

Incremental

Fig. 4. Evaluation times for the auxiliary views pi and ewGauss

following reason: In both cases, the views have to be maintained N -times for the N updates respectively. The naive algorithm completely recalculates the views, leading to a second factor N . In order to investigate the reasons for the performance diﬀerences more deeply, we compared the naive and incremental recomputation of ewGauss and pi for the number of measurements from above, respectively. In Figure 4 the performance diﬀerences are presented showing the dramatic impact of using our incremental approach for computing ewGauss. In this view, an expensive three-way join is computed involving the views estimate and pi as well as the table measurement. In our incremental approach, the Gaussian distribution is computed for the newly inserted measurement tuple, only. Thus, the redundant recomputation of the previous Gaussian values is avoided. In addition, the join computation in ewGauss is accelerated by using an indexation of the materialized view values such that the search for join partners can be done in constant time, for example, by using hashing. In contrast to this, the view pi computes a simple join of table PMHT and view measCount where the former contains just one tuple comprising the underlying sensor characteristics. Consequently, the speed up using our incremental approach remains quite small.

Detecting Moving Objects in Noisy Radar Data Using a Relational Database

7

299

Conclusion

In this paper we presented a solution how to seamlessly integrate a tracking algorithm into a moving object database system using SQL queries. We have shown that update propagation is indeed well-suited for eﬃciently implementing the PMHT approach and additionally allows for simulating the missing aggregate function PROD in SQL. Dealing with uncertainty is an important aspect in the ﬁeld of moving object databases [10,23,24], but an SQL-based implementation for detecting unknown moving objects using hypothesis testing [5,7,21,26,27] has not been published so far. Incremental update propagation, however, has undergone investigations for almost thirty years, e.g. [16,18], and an SQL-based approach using triggers has been already proposed in [9]. A method for incrementally maintaining views with aggregate functions has been proposed in [8,15]. Our own streaming system [6] and results from [11] showed indeed the feasibility of incremental UP in a streaming context. The implementation of this paper supports the claim that even conventional SQL queries can be eﬃciently employed for analyzing a wide spectrum of data streams. This is interesting as it is widely believed by now that conventional database systems are not well-suited for dynamically processing continuous queries [4,12,20] and dedicated stream processing systems such as STREAM [1] are needed. Although the performance results are really promising, it would be interesting to compare our approach with materialized view maintenance techniques in other systems like Oracle or DB2. Additionally, in our radar surveillance system object classiﬁcation is another crucial problem which we intend to solve in a similar way.

References 1. Arasu, A., et al.: STREAM: The Stanford Stream Data Manager (demonstration description). In: SIGMOD, pp. 665–665 (2003) 2. Abadi, D.J., et al.: Aurora: A Data Stream Management System. In: SIGMOD 2003, p. 666 (2003) 3. Abadi, D.J., et al.: An Integration Framework for Sensor Networks and Data Stream Management Systems. In: VLDB 2004, pp. 1361–1364 (2004) 4. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS, pp. 1–16 (2002) 5. Bar-Shalom, Y., Fortmann, T.E.: Tracking and Data Association. Academic Press, New York (1988) 6. Behrend, A., Dorau, C., Manthey, R., Sch¨ uller, G.: Incremental View-Based Analysis of Stock Market Data Streams. In: IDEAS, pp. 269–275 (2008) 7. Blackmann, S.S., Populi, R.: Design and Analysis of Modern Tracking Systems. Artech House, Boston (1999) 8. Chan, M., Leong, H.V., Si, A.: Incremental Update to Aggregated Information for Data Warehouses over Internet. In: DOLAP, pp. 57–64 (2000) 9. Ceri, S., Widom, J.: Deriving Production Rules for Incremental View Maintenance. In: VLDB, pp. 577–589 (1991) 10. Dellaert, F.: The Expectation Maximization Algorithm. Technical Report GITGVU-02-20, Georgia Institute of Technology, Atlanta, USA (2002)

300

A. Behrend et al.

11. Ghanem, T.M.: Incremental Evaluation of Sliding-Window Queries over Data Streams. IEEE Trans. on Knowl. and Data Eng. 19(1), 57–72 (2007) ¨ 12. Golab, L., Ozsu, M.T.: Issues in Data Stream Management. SIGMOD Record 32(2), 5–14 (2003) 13. Griﬃn, T., Libkin, L.: Incremental maintenance of views with duplicates. In: SIGMOD 1995, San Jose, May 23-25, pp. 328–339 (1995) 14. Gupta, A., Mumick, I.S. (eds.): Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999) 15. Gupta, A., Mumick, I.S., Subrahmanian, V.S.: Maintaining Views Incrementally. In: SIGMOD, pp. 157–166 (1993) 16. Manthey, R.: Reﬂections on Some Fundamental Issues of Rule-based Incremental Update Propagation. In: DAISD, pp. 255–276 (1994) 17. Madden, S., Franklin, M.J.: Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. In: ICDE 2002, pp. 555–566 (2002) 18. Qian, X., Wiederhold, G.: Incremental Recomputation of Active Relational Expressions. Knowledge and Data Engineering 3(3), 337–341 (1991) 19. Seltzer, M.: Beyond Relational Databases. Communications of the ACM 51(7), 52–58 (2008) 20. Stonebraker, M., Cetintemel, U.: “One Size Fits All”: An Idea Whose Time Has Come and Gone. In: ICDE, pp. 2–11 (2005) 21. Streit, R., Luginbuhl, T.E.: Probabilistic Multihypothesis Tracking. Tech. Rep. NUWC-NPT/10/428, Naval Undersea Warfare Center, Newport USA (1995) 22. Subramanian, S., et al.: Continuous Queries in Oracle. In: VLDB, pp. 1173–1184 (2007) 23. Tanner, M.A.: Tools for Statistical Inference. Springer, New York (1996) 24. Trajcevski, G., Wolfson, O., Hinrichs, K., Chamberlain, S.: Managing Uncertainty in Moving Objects Databases. ACM Trans. Database Syst. 29(3), 463–507 (2004) 25. Wald, A.: Sequential Analysis. John Wiley & Sons, New York (1947) 26. Wieneke, M., Koch, W.: The PMHT: Solutions for some of its Problems, pp. 1–12. SPIE (2007) 27. Wieneke, M., Koch, W.: On Sequential Track Extraction within the PMHT Framework. EURASIP Journal on Advances in Signal Processing (2008) 28. Wolfson, O., Xu, B., Chamberlain, S., Jiang, L.: Moving Objects Databases: Issues and Solutions. In: SSDBM, pp. 111–122 (1998)

Study of Dependencies in Executions of E-Contract Activities K. Vidyasankar1,*, P. Radha Krishna2, and Kamalakar Karlapalem3 1

Department of Computer Science, Memorial University, St. John’s, Canada, A1B 3X5 [email protected] 2 SET Labs, Infosys Technologies Limited, Hyderabad, India [email protected] 3 International Institute of Information Technology, Hyderabad, India [email protected]

Abstract. An e-contract is a contract modeled, specified, executed, controlled and monitored by a software system. A contract is a legal agreement involving parties, activities, clauses and payments. The goals of an e-contract include precise specification of the activities of the contract, mapping them into deployable workflows, and providing transactional support in their execution. Activities in a contract are generally complex and interdependent. They may be executed by different parties autonomously and in a loosely coupled fashion. They differ from database transactions in many ways: (i) Different successful executions are possible for an activity; (ii) Unsuccessful executions may be compensated or re-executed to get different results; (iii) Whether an execution is successful or not may not be known until after several subsequent activities are executed, and so it may be compensated and/or re-executed at different times relative to the execution of other activities; (iv) Compensation or re-execution of an activity may require compensation or re-execution of several other activities; etc. In this paper, we study the interdependencies between the executions of e-contract activities. This study will be helpful in monitoring behavioral conditions stated in an e-contracts during its execution.

1 Introduction An electronic contract, or e-contract in short, is a contract modeled, specified, executed, controlled and monitored by a software system. A contract is a legal agreement involving parties, activities, clauses and payments. The activities are to be executed by the parties satisfying the clauses, with the associated terms of payment. Consider, for example, a contract for building a house. The parties of this contract include a customer, a builder and a bank. The customer will get a loan for the construction from the bank. He will apply for a mortgage and work out details of payment to the builder (for example, direct payment from the bank to the builder after inspection of the work at multiple intervals). The builder will construct the house according *

This research is supported in part by the Natural Sciences and Engineering Research Council of Canada Discovery Grant 3182.

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 301–313, 2009. © Springer-Verlag Berlin Heidelberg 2009

302

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

to the specifications of the customer. The builder’s activities include: (i) scheduling different works involved in the construction and procuring raw material, (ii) building the house as per the agreement, (iii) giving part of the work such as carpentry, plumbing and electrical work to sub-contracts, if any, (iv) receiving the payments from the bank, (v) making payments to his staff and sub-contract parties, if any, and (vi) handing over the constructed house to the customer. Majority of contracts in real world are documents that need to be gleaned to come up with specifications that are executed electronically. The execution can also be fairly complex. The goals of the e-contract include precise specification of the activities, mapping them into deployable workflows, and providing transactional support in their execution. All the properties of database transactions are applicable to e-contract activities also. In database applications, atomicity is strived for in a (simple) transaction execution. That is, a transaction is executed either completely or (effectively) not at all. Given a non-null partial execution, the former is obtained by forward-recovery and the latter by backward-recovery. On successful completion, the transaction is committed. In multidatabase and other advanced database applications, transactions may be committed (locally) and then rolled back logically, by executing compensating transactions. This property is called compensatability. The property of repeatedly executing a transaction until successful completion is also considered; this is called retriability. We have addressed transactional properties of e-contract activities, in [8], with a multi-level composition model for the activities. We start with some basic activities and construct composite activities hierarchically. In the first level, a composite activity consists of basic activities; in the next level, a composite activity consists of basic and/or composite activities of level one; etc. The highest level activity will correspond to the “single” activity for which the contract is made. We call this the contractactivity. (We note that there could be multiple contracts for a single activity. For example, for building a house, there could be separate contracts between (i) customer and the builder, (ii) customer and the bank, (iii) customer, bank and insurance company, etc. These contracts will be related. We consider this set of contracts as a part of a single high level contract whose contract-activity is building the house.) For activity at each level, we consider successful execution, atomicity, compensatability, retriability, forward- and backward-recovery properties. We then define commitment of the activities (in fact, two notions of commitment, strong and weak) based on these properties. We do this uniformly, the same way irrespective of the level of the activity. (All these properties are described in detail in Section 3.) E-contract activities differ from database transactions in many ways: (i) Different successful executions are possible for an activity; (ii) Unsuccessful executions may be compensated or re-executed to get different results; (iii) Whether an execution is successful or not may not be known until after several subsequent activities are executed, and so it may be compensated and/or reexecuted at different times relative to the execution of other activities; (iv) Compensation or re-execution of an activity may require compensation or reexecution of several other activities; etc.

Study of Dependencies in Executions of E-Contract Activities

303

These characteristics give rise to E-contract document sophisticated interdependencies between Specification Engine executions of different activities. The Activity/Clause Specification dependencies deeply impact both the recovery and commitment aspects. Workflow Composition Figure 1 shows two components: specifispecification Model cation engine and execution engine. E-contract document is the basic input to Dependencies Specification the entire system. The specification engine extracts activities and clauses specifications. These specifications are useful to generate workflow specifications and Log Database Manager multi-level composition model, and derive the dependencies between activities. The dependencies dictate the recovery strategies. Using the audit trials provided Execution Engine Workflow Commitment by the log manager, the components of the Engine Engine execution engine ensure the atomicity of the executions of the e-contract activities. Dependencies & Recovery The rest of the paper is organized as coordinator follows. Section 2 describes some related work. We present the basic concepts Fig. 1. E-Contract Activity Commitment and related to our model in Section 3 and the Dependency System model in Section 4. Section 5 presents the dependencies and Section 6 presents an example to illustrate the dependencies. Section 7 considers the general model and Section 8 summarizes the paper.

2 Related Work Some of the dependencies identified in this paper are along the lines of those for database transactions given by Chrysanthis and Ramamrithm in [2]. A few papers in the literature discuss the transactional properties of e-contract activities. Papazoglou [6] describes a taxonomy of e-business transaction features and presents a business transaction model that relax isolation and atomicity requirements of ACID transactions in a loosely coupled environment consisting of autonomous trading partners. This paper also describes backward and forward recovery for long-running business transactions. Krishna et al. [5] consider activity-party-clauses and activitycommit diagrams for modeling and monitoring e-contracts. These constructs are used to express the execution order and execution status of the contract that is being considered. Rouached et al. [7] present an event-based framework associated with a semantic definition of the commitments expressed in the event calculus to model and monitor multi-party contracts. Xu [10] proposes a pro-active e-contract monitoring system that is based on contract constraints and guards of the contract constraints to monitor contract violations . This paper represents constraints using propositional temporal logic in order to provide formal semantics for contract computation at the contract fulfillment stage. However, the formalism in this paper does not provide the

304

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

execution level semantics of an e-contract commitment. Wang et al [9] describe a Business Transaction Framework based on Abstract Transactional Constructs, which provides a specification language for identifying and interpreting clauses in econtracts. Grefen and Vonk [3] describe the relationship between transaction management systems and workflows for transactional business process support. Jain et al. [4] present a flexible composition of commitments, known as metacommitments. These commitments are mainly associated with the role of a party and ensuring whether a particular activity is committed or not. They do not refer the commitments with respect to the execution states of an e-contract activity. To the best of our knowledge, dependencies based on the transactional properties of the execution of econtract activities have not been studied in the literature.

3 Basic Concepts In this section, we present the concepts and notations relevant for transactional properties in the context of e-contracts, and in the next section we present our model. 3.1 Basic Activities Some activities are basic in our model. Typically, these are the activities which cannot be decomposed into smaller activities, or those that we want to consider in entirety, and not in terms of its constituent activities. In e-contract environment, whereas some basic activities may be executed ‘electronically’ (for example, processing a payment) most others will be non-electronic (for example, painting a door). We desire that all basic activities are executed atomically, that is, either not executed at all or executed completely. However, incomplete executions are unavoidable and we consider them in our model. 3.2 Constraints Each activity is executed under some constraints. Examples of constraints are (i) who can execute the activity, (ii) when it can be executed, (iii) whether it can be executed within a specified time period, (iv) cost of execution, (v) what properties need to be satisfied for an execution to be acceptable, and (vi) compensatability or other transactional properties. The first four constraints relate to workflow semantics. The last two relate to transactional semantics. In the following, we consider constraints related to transactional semantics only. An execution of an activity that satisfies all the constraints specified for the execution of that activity at the time of its execution is called a successful termination, abbreviated s-termination, of that activity. The constraints themselves are specified in terms of an s-termination predicate, or simply, st-predicate. An execution which does not satisfy the st-predicate is called a failed termination, abbreviated f-termination. For many activities, especially non-electronic ones, some acceptability criteria may be highly subjective and depend on the application environment. For example, consider the activity of building a wall. Quantitative aspects such as the dimensions of the wall, its location, etc. can be expressed easily. Smoothness of the finished surface and extent of the roundedness of the corners will be application dependent. The requirements for a

Study of Dependencies in Executions of E-Contract Activities

305

wall in a children’s hospital will be different from those for one in an army barrack. We propose that a predicate, termed property-predicate, be defined for each of the requirements and the acceptability, that is the st-predicate, be stated in terms of satisfying a Boolean expression of the property-predicates. Determining whether a propertypredicate is satisfied or not in an execution will be left to the application semantics. Thus, the st-predicate for the construction of a wall could be (d AND s AND r) where d is the dimension predicate stating whether the dimensions of the wall are according to specifications, s is the smoothness predicate and r is the roundedness (of the corners) predicate. Then, an execution which does not satisfy one or more of these predicates will be an f-termination. Clearly, several different f-terminations are possible. As another example, the st-predicate for finishing a wall could be ((u AND o) OR (u AND w)) where u refers to an undercoat of painting, o is an overcoat with smooth finish and w is wall-papering. Here, two s-terminations are possible, one yielding a painted surface and the other with wall paper. The constraints may change, that is, the st-predicate of an activity may change, as the execution of the contract proceeds. In the above example of building a wall, the required thickness of the wall may change from 6 inches to 8 inches, thus changing the dimension predicate. Such changes may invalidate a previous s-termination. When this happens, the execution needs to be adjusted. We note also that, with changes in the st-predicate, an earlier f-terminated execution may become an s-termination. It follows that we may not know whether a termination is an s-termination or an f-termination until some time later. 3.3 Compensatibility One of the ways an execution can be adjusted is by compensation, namely, nullifying the effects of the execution. Absolute compensation may not be possible in several situations. In some cases, the effects of the original execution may be ignored or penalized and the execution itself considered as compensated. Note that we do not attribute compensatability property to an activity, but only to an execution of that activity. For the same activity, some executions may be compensatable, whereas others may not be. For example, when we book flight tickets we may find that some tickets are non-refundable, some are fully refundable, and some others partially refundable. Purchasing a fully refundable ticket may be considered to be a compensatable execution, whereas purchasing any other type of ticket could be noncompensatable. Thus, compensatability of the execution (purchasing a flight ticket) may be known only during execution, and not at the specification time of the activity. We look at compensation as a logical roll back of the original execution. Then, compensation may also involve execution of some other, compensating, activity. 3.4 Retriability Another way of adjusting an execution is by retrying. By retriability, we mean the ability to get a complete execution satisfying the (possibly new) st-predicate. It is possible that the original execution is compensated fully and new execution carried out, or the original execution is complemented, perhaps after a partial compensation, with some additional execution. An example of the latter is, a day after pouring concrete

306

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

for the foundation of a house, the thickness of the concrete may be found to be insufficient, and additional concrete poured for the required thickness. Retriability may also depend on the properties of execution of other preceding, succeeding or parallel activities. Again, in general, some executions of an activity may be retriable, and some others may not be. We note that retriability property is orthogonal to compensatability. That is, an execution may or may not be retriable, and, independently, may or may not be compensatable. 3.5 Execution States We consider an execution of an activity with a specified st-predicate. On a termination, if we are not satisfied with the outcome, that is, the st-predicate of that activity is not satisfied, then we may re-execute. In general, several re-executions and hence terminations are possible. We assume the following progression of the states of the (complete or incomplete) terminations. 1. 2.

The termination is both compensatable and re-executable. At some stage, the termination becomes non-compensatable, but is still reexecutable. Then, perhaps after a few more re-executions, we get a termination which is either a. b.

3.

non-re-executable to get a complete s-termination, and we take this as an f-termination, or re-executable to get eventually a complete s-termination. We identify this state as non-compensatable but retriable.

Continuing re-executions in state 2.b, at some stage, we get a complete stermination which is non-compensatable and non-re-executable.

It is also possible that an (un-compensated) execution remains in state 1 and never goes to state 2, and similarly an execution is in state 2.b, but never goes to state 3. We say that an execution in state 2.b is weakly committed, that is, when it is or has become non-compensatable, but is retriable. An execution in state 3 is strongly committed. We note that both weak and strong commitments can be forced upon externally also. That is, the execution can be deemed as (weakly or strongly) committed, for reasons outside of that execution. An example is payment to a sub-contractor for execution of an activity, and the non-obligation and unwillingness of the subcontractor to compensate (in case of weak commitment) or retry (in case of strong commitment) the execution after receiving the payment. We say also that an activity is weakly (strongly) committed when an execution of that activity is weakly (strongly) committed.

4 Composition Model for Activities In this section, we briefly describe our execution model for composite activities in econtract. Only the bottom-most level, where each composite activity is composed of basic activities, is considered here. We use bold font to denote compositions, and italics to denote their executions, that is, the composite activities.

Study of Dependencies in Executions of E-Contract Activities

307

Composition - Composition C is a rooted tree. It is for an activity of a higher level composition U. - An st-predicate is associated with C. This will prescribe the s-terminations of C. - Nodes in the tree correspond to basic activities. They are denoted as a1, a2, etc. - With each node in the tree, an st-predicate and a children execution predicate, abbreviated ce-predicate, are associated. The st-predicate specifies s-terminations of that activity. The ce-predicate specifies, for each s-termination of that node, a set of children which have to be executed. It will be null for all leaves of C. At nonleaf nodes: (i) more than one child may be required to be executed; (ii) in general, several sets of children may be specified with the requirement that one of those sets be executed; (iii) these sets may be prioritized in an arbitrary way; and (iv) the execution of children within a set may also be prioritized. - We assume that the st-predicate and ce-predicate of each node in C are derived from the st-predicate of C. Execution - An execution of activity ai is denoted ai. An execution E of C yields a composite activity C, which is a sub-tree of C, called an execution-tree, such that: o It includes the root and some descendents; o Some nodes are (fully compensated) f-terminations; If a node is an ftermination, then all descendents of that node in the execution tree are also fterminations; and o The execution of each s-terminated node satisfies the st-predicate prescribed for that node, and the non-f-terminated children of each non-leaf node of the sub-tree satisfy (fully or partially) the ce-predicate specified in C for that node. - An s-termination of C is an execution of C such that the non-f-terminated nodes yield a sub-tree of C that contains (i) the root, (ii) some leaves of C and (iii) all nodes and edges in the paths from the root to those leaves. Transactional Properties - The execution of the entire composition C is intended to be atomic relative to U. That is, an execution of C should yield a complete s-termination or the null termination. Therefore, if an s-termination of an activity ai is not possible in some execution, then (that execution of ai is compensated and) execution of a different set of children satisfying the ce-predicate of its parent is tried. If unsuccessful, then a different s-termination of the parent is tried. If not, then similar adjustments at the grand-parent level are tried, and so on. Thus, either a complete backward recovery yielding the null termination or a partial backward recovery followed by forward execution to get an s-termination of C is carried out.

5 Dependencies We start with a detailed discussion of the backward recovery process and then consider dependencies. For simplicity, we cover the essential points with the special case

308

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

where the execution-tree is a path. Additional points that occur in the general case are discussed later. Typically, the recovery will start with the notion of adjustment of the execution of aj, for some j ≤ i, where ai is the latest activity that has been or is being executed. If aj has to be compensated, then all activities in the execution following aj are also compensated, and a different child of aj-1 is chosen with possibly an updated ce-predicate at aj-1. If aj is re-executed, then aj+1 may need to be either compensated or reexecuted. Continuing this way, we will find that for some k, k ≥ j, the activities in the sequence (aj, …, ak) are re-executed and those in (ak+1,…, ai) are compensated. This is illustrated in the bottom half of Figure 2. (The top half is explained later.) In the above argument, the first activity ak+1 that needs to a1 be compensated is determined after re-executing its preceding am activity ak.. It is quite possible, Last strong commitment point in some cases, that ak+1 is deStrongly Committed part termined even before reexecuting its predecessors. It is al Weakly Committed also possible that for some of part the activities in (aj+1,…, ak), their previous executions are still valid, that is, no reaj executions are necessary. We Re-execution point Re-executed simply take this as requiring part “trivial” re-executions. Adjusted ak+1 part In Figure 2, we note that if m is the largest index such that Compensated am is strongly committed, then part ai j > m, and if n is the largest index such that an is weakly committed, then k+1 > n. This Fig. 2. Partial backward-recovery in the Path model follows since, by the definitions of strong and weak commitments, executions of activities up to am cannot be retried and those up to an cannot be compensated. In the figure, an is not shown. It will be between am and ak+1. Table 1. Dependency-Table 1

aj ai

Compensate

Compensate

√

Weak Commit √

Strong Commit

Weak Commit

×

√

√

Strong Commit

×

×

√

√

Study of Dependencies in Executions of E-Contract Activities

309

Several dependencies are possible between execution states of different activities. I. In general, any of the compensation, weak commit and strong commit actions on one activity may require any of these three actions for another activity. Such dependencies are similar to the abort and commit dependencies in [2]. They are given in Dependency-Table 1. The ‘√’ entries indicate the possibilities of the corresponding dependencies, and the ‘×’ entries indicate the impossibility. The relative positions of the nodes ai and aj are as in Figure 2, that is, ai is a descendent of aj. The entries in the table describe the dependencies of the type: “if a specified action is done on the execution of ai then a specified action has to be done on the execution of aj”, and also the dependencies where the roles of ai and aj are reversed. Recall that the s- or f-termination status of an execution may be known only at a later time. Hence, with respect to Figure 2, it is possible that the f-termination of aj is known only after ai is executed. Thus, it makes sense to talk about how the actions on a node affect the executions of its descendents. Note also the following. -

-

We assume that both weak and strong commitments are in top-down order. Therefore, if ai is weakly committed, then aj must be weakly committed too if it has not been done already. The same applies to strong commitment. If aj is compensated, then ai must be compensated too.

II. Several dependencies which involve re-execution are also possible. We arrive at a general form in several steps. 1. In our formalism, a change in the st-predicate of an activity may change the status of its earlier execution from s- to f-termination and hence warrant either a reexecution to get a new s-termination or compensation. That is, a change in the stpredicate value can account for both retrying and compensation. Therefore, we can define dependencies of the form: •

An f-termination of an activity changes the st-predicate of another activity and, in fact, of several activities.

2. Secondly, recall that the st-predicate is a Boolean expression of propertypredicates. Then an f-termination means that some of these predicates are not satisfied. Depending on the property-predicates that are not satisfied, several f-terminations are possible. We allow for each of these f-terminations to change the st-predicates of other activities possibly differently. Therefore, we can expand the dependencies as follows. •

Each different type of f-termination of an activity changes the st-predicates of a set of activities in a specific way.

3. Dependencies involving s-terminations are also possible. We have seen that different s-terminations are possible. Each can affect other activities differently. Therefore, a general form of dependencies is: A specific (s- or f-) termination of an execution changes the st-predicates of a set of activities in a specific way. Note that this takes care of another case also: An execution of an activity ak may be an f-termination (with respect to st-predicate prescribed for that activity) but, for some

310

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

reasons, we need to keep that execution. Then, the only way could be changing the stpredicates of some other activities which in turn change the st-predicate of ak and make the current execution a s-termination. III. We can also state dependencies of the following type. A specific (s- or f-) termination of an activity triggers compensation, weak commit or strong commit of executions of some other activities. (The top half of Figure 2 shows the possibility that compensation or reexecution of the activities in (aj, …, ai) may trigger weak and strong commits of some earlier executions.) The (compensate, re-execute, weak commit and strong commit) actions on ai change the st-predicates of some other activities. The execution of an activity ai can be weakly committed any time, and then, after an s-termination, can be strongly committed any time. Weak commitment immediately after the s-termination gives pivotal property in the traditional sense. Waiting until the end of the execution of the entire composite activity will give the compensatability and/or re-executability options until the very end. The longer the commitment is delayed, the more flexibility we have for adjustment on execution of the subsequent activities. However, as we have seen above, executions and commitments of some subsequent activities may also force the commitment of ai. IV. Dependencies constraining the beginning of an execution of an activity can also be defined. For example, for activities aj and descendent ai possible dependencies are: ai cannot begin execution until aj (i) s-terminates, (ii) weak-commits, or (iii) strongcommits. Note that our composition model assumes that the execution of ai cannot begin until the execution of aj begins.

6 Procurement Example We illustrate some dependencies with an example, that is drawn from the contract for building a house explained in Section 1, that concerns with procurement of a set of windows for the house under construction. The order will contain a detailed list of the number of windows, the size and type of each of them and delivery date. The type description may consist of whether part of the window can be opened and, if so, how it can be opened, insulation and draft protection details, whether made up of single glass or double glass, etc. The activities are described in the following. The executiontree is simply a directed path containing nodes for each of the activities in the given order. P1. Buyer: Order Preparation – Prepare an order and send it to a seller. P2. Seller: Order Acceptance – Check the availability of raw materials and the feasibility of meeting the due date, and, if both are satisfactory, then accept the order. P3. Seller: Arrange Manufacturing – Forward the order to a manufacturing plant. P4. Plant: Manufacturing – Manufacture the goods in the order.

Study of Dependencies in Executions of E-Contract Activities

311

P5. Plant: Arrange Shipping – Choose a shipping agent (SA) for shipment of the goods to the buyer. P6. SA: Shipping - Pack and ship goods. P7. Buyer: Check Goods – Verify that the goods satisfy the prescribed requirements. P8. Buyer: Make Payment – Pay the seller. We describe several scenarios giving rise to different transactional properties. 1) Suppose that once the seller decides to accept the order, the order cannot be cancelled by the buyer or the seller, but modifications to the order are allowed, for example, delivery date changed, quantity increased, etc. If only the modifications that do not result in the non-fulfillment and hence cancellation of the order are allowed, then when the seller accepts the order, both P1 and P2 can be weakly committed. (On the other hand, if there is a possibility of the order getting cancelled, weak commitment has to be postponed. We do not consider this case any further in the following.) 2) There may be a dependency stating that the order can be sent to the manufacturing plant only after its acceptance by the seller, that is, the execution of P3 can begin only after P2 is weakly committed. 3) The plant may find that the goods cannot be manufactured according to the specifications, that is, P4 fails. Then the buyer may be requested to modify the order. For example, if the failure is due to inability to produce the required quantity by the due date, then the modification could be extension of the due date or reduction of the quantity or both. (Similar situation arises when the buyer wants to update the order by increasing the quantity.) This will result in a re-execution of P1 followed by a re-execution of P2. Then the past execution of P4 may be successful or a re-execution may be done. Weak commitments of P1 and P2 allow for such adjustments. 4) If the buyer finds that the goods do not meet the type specifications, that is, P7 fails, then, P4 has to be re-executed. In addition, P5 and P6 have to be reexecuted. (This situation may arise also when the plant realizes some defects in the goods and “recalls” them after they were shipped.) Here, the re-executions may consist of the buyer shipping back the already received goods to the plant and the plant shipping the new goods to the buyer. An example is: two of the windows have broken glasses and a wrong knob was sent for a third window. (The knob has to be fixed after mounting the window.) Then, replacements for the two windows have to be made (in P4), the damaged windows and the wrong knob have to be picked up and the new ones delivered, perhaps by the same shipping agent (in which case the re-execution of P5 is trivial). 5) The shipping agent is unable to pack and ship goods at the designated time, that is, P6 fails. Then either the delivery date is postponed (adjustment in the st-predicate of P1) or the plant may find another shipping agent, that is, P5 is re-executed. In the latter case, it follows that P6 will also be re-executed.

312

K. Vidyasankar, P.R. Krishna, and K. Karlapalem

7 General Case In the general case where the execution-tree is not a path, the dependencies and the partial rollback are similar to the path case. The difference is only in the complexity of the details. Partial backward-recovery of E will again consist of retrying the executions of some of the activities of the execution-tree and compensating some others. This is illustrated in Figure 3.

Re-executed part Compensated part

Fig. 3. Partial backward-recovery

All the dependencies discussed so far are applicable in the general case also, both for vertically (that is, ancestrally) and horizontally related activities. In addition, for horizontally related activities ai and aj, all combinations in the Dependency-Table 1 are possible, that is, all entries will be ‘√’. Dependencies that involve ce-predicates are also possible. A general statement would be: A specific (s- or f-) termination, compensate, weak and strong commit actions of an activity changes the ce-predicates of some other activities. Procurement example revisited. In the example illustrated in the last section, suppose the seller splits the order into two parts and assigns them to two plants Plant-A and Plant-B. Then the node P3 will have two children and its ce-predicate will contain the details of the individual orders. Corresponding to P4, P5 and P6, we will have P4A, P5-A and P6-A for Plant-A, and P4-B, P5-B and P6-B for Plant-B. We describe a few scenarios and the resulting dependencies. 1) The seller may decide that shipping should not start until all the goods in the order have been manufactured. The gives rise to the dependencies: begin P5-A and P5-B only after both P4-A and P4-B s-terminate. 2) P5-A fails, that is, Plant-A is unable to find a shipping agent. Then, the shipping agent of Plant-B may be asked to ship the goods of Plant-A also. This may involve changing the st-predicate if the execution of P6-B has not been done or re-execution of P6-B otherwise.

Study of Dependencies in Executions of E-Contract Activities

313

3) The buyer is not satisfied with the goods manufactured in Plant-A, that is, P7 fails. Then, the seller might settle for the buyer returning those goods and Plant-B manufacture those goods and send to the buyer. This involves changing the ce-predicate at P3, compensation of P4-A, P5-A and P6-A, and reexecution of P4-B, P5-B and P6-B. In the general multi-level model, the definitions are extended across multiple levels. The activities that are re-executed or rolled back would, in general, be composite activities, that too executed by different parties autonomously. Therefore, the choices for re-execution and roll back may be limited and considerable pre-planning may be required in the design phase of the contract. Due to paucity of space, we omit the details with the multi-level composite activities.

8 Summary An e-contract system must ensure the progress of activities and their termination. Since e-contracts consist of multiple activities executed with several inter-dependencies, any failure could have cascading effects on other executed or executing activities. In this paper, we have brought out these dependencies explicitly and facilitated solutions that can be incorporated within an e-contract system. This study will be helpful in monitoring behavioral conditions stated in an e-contract during its execution.

References 1. Chiu, D.K.W., Karlapalem, K., Li, Q., Kafeza, E.: Workflow View Based E-Contracts in a Cross-Organizational E-Services Environment. Distributed and Parallel Databases 12(2/3), 193–216 (2002) 2. Chrysanthis, P.K., Ramamritham, K.: A Formalism for Extended Transaction Models. In: Proc. of the 17th Int. Conf. on Very Large Data Bases, pp. 103–112 (1991) 3. Grefen, P., Vonk, J.: A Taxonomy of Transactional Workflow Support. International Journal of Cooperative Information Systems 15(1), 87–118 (2006) 4. Jain, A.K., Aparicio IV, M., Singh, M.P.: Agents for Process Coherence in Virtual Enterprises. Communications of the ACM 42(3), 62–69 (1999) 5. Krishna, P.R., Karlapalem, K., Dani, A.R.: From Contracts to E-contracts: Modeling and Enactment. Information Technology and Management 6, 363–387 (2005) 6. Papazoglou, M.P.: Web Services and Business Transactions. World Wide Web: Internet and Web Information Systems 6, 49–91 (2003) 7. Rouached, M., Perrin, O., Godart, C.: A contract-based approach for monitoring collaborative web services using commitments in the event calculus. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 426–434. Springer, Heidelberg (2005) 8. Vidyasankar, K., Radha Krishna, P., Karlapalem, K.: A Multi-Level Model for Activity Commitments in E-contracts. In: Meersman, R., Tari, Z. (eds.) OTM 2007, Part I. LNCS, vol. 4803, pp. 300–317. Springer, Heidelberg (2007) 9. Wang, T., Grefen, P., Vonk, J.: Abstract Transaction Construct: Building a Transaction Framework for Contract-driven, Service-oriented Business Processes. In: Dan, A., Lamersdorf, W. (eds.) ICSOC 2006. LNCS, vol. 4294, pp. 434–439. Springer, Heidelberg (2006) 10. Xu, L.: A Multi-party Contract Model. ACM SIGecom Exchanges 5(1), 13–23 (2004)

Object Tag Architecture for Innovative Intelligent Transportation Systems Krishan Sabaragamu Koralalage and Noriaki Yoshiura Department of Information and Computer Sciences, Saitama University, Saitama, 338-8570, Japan [email protected], [email protected]

Abstract. Safety is the paramount reason for Intelligent Transportation Systems (ITS). There are three main actors in ITS system: users, vehicles and infrastructure. Though the communications among those three actors are very vital, there is no common platform to make extensive communication among those three actors yet. That is one of the main reasons to occur fatalities. Therefore we consider the Radio Frequency (RF) identification as a candidate technology and develop a novel tag architecture called OTag (Object Tag) to enable the communication among them including vehicle to vehicle. In this paper we explain the OTag architecture and its protocol which enables a common communication platform. Furthermore, access control mechanisms, ability to be interoperable, stand-alone, self-describing, and plug and play usage are also described. Thus, how OTag architecture will advance the existing ITSs and create novel applications to support safe, secure, comfortable and productive social life in ecofriendly manner are concentrated. Keywords: Intelligent Transportation System, RFID.

1 Introduction All over the world there are millions of injuries in crashes and fatalities occur year by year. It is not only the victims but also their families have to suffer from such fatalities. Therefore there is a very clear necessity to reduce such fatalities and injuries due to lacks of existing transportation systems. For the same reason, most of the governments are putting their fullest effort to develop safe, secure, comfortable, productive ITS in eco-friendly manner. ITS is all about improving the infrastructure for travelers or goods using information and telecommunication technologies. Therefore, the topic ITS, has become one of the most attractive and innovative research areas in the world. GPS based navigation systems, electronic toll collection system, speed surveillance camera systems, etc. are some of the latest ITSs in use. Focuses of ITSs are on improving navigation systems, automated enforcement of speed limits and traffic signals, optimization of traffic management, increasing efficiency in road management, infrastructure-based collision warning systems, vehiclebased measures for crash avoidance, identification and prioritization of emergency vehicles, assistance for safe driving, electronic fee collection systems, support for J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 314–329, 2009. © Springer-Verlag Berlin Heidelberg 2009

Object Tag Architecture for Innovative Intelligent Transportation Systems

315

public transportation, increasing efficiency in commercial vehicle operations, and support for pedestrians. Since the society requires above mentioned applications, most of the governments welcome innovative and proprietary ITSs even without interoperability and standardization. However our study found that the most of the existing ITSs are proprietary and not interoperable, and also need to refer the databases almost all the time. This is good for several applications but some need quick information and several abilities such as, being stand alone, self-describing, plug and play and interoperable with role based control in accessing. Therefore using ITSs to enhance secure, safe, comfortable, and productive social life is a challenging task. In reality it is desired to have a safe, secure, comfortable, productive, standard, reliable, and interoperable infrastructure to support innovative ITS applications that ultimately could help to achieve autonomous transportation. For that it is necessary to manage many to many communications. In other words, communication among three main actors: user (passengers/pedestrians), vehicle and infrastructure should be facilitated including vehicle to vehicle. 1.1 RFID and ITS Among GPS, infrared, vision-assist and other distinctive technologies, Radio Frequency Identification (RFID) has become one of the best candidate technologies for ITSs because it allows non-contact, non-line of sight, long distance object identification. Currently, there are three types of RFID tags: active, passive and semi-passive. For the first time in ITSs, this technology is used for automated toll collection systems. In the ITS applications, we believe that well designed active tag can play a big role to achieve the safety, security, productivity, comfortability, and interoperability. Currently active tags are being used to control access to restricted areas, toll collection, parking management and tracking of vehicles and goods. 1.2 Motivation, Goal and Objectives The first motivation to develop a novel architecture for ITSs was raised because there is no proper system to support emergency vehicles such as ambulance. The main reason is the lack of proper communication among three actors: users, vehicles, and infrastructure including vehicle to vehicle. Here the passengers and the pedestrians are referred as users. Therefore it is necessary to create a common platform to communicate among three actors. Such a common platform enables, prioritizing emergency vehicles, changing traffic lights adaptively, prior informing of phase changing on traffic lights, understanding vehicle movements, detecting collisions before entering to intersections or merging points, vehicle registrations and ownership transference, etc. Though RFID is one of the best candidate technologies, the existing tags are not suitable to develop such a common platform because it is used to store only a unique ID. Additionally they are unable to provide interoperability and the abilities of stand alone, self describing, plug and play with role base accessing mechanisms. Unlike existing automated toll collection systems, if such a platform is developed with novel tag, there will be many readers in different systems reading the tag information in a

316

K.S. Koralalage and N. Yoshiura

given time. Such tag should allow reading themselves only for the authorized readers depending on their rights. But this cannot be achieved using the existing or the previously proposed tag architectures. Instead it is necessary to have several tags to allow being read by different systems. Additionally, attaching several tags for different systems to work on vehicles are a burden for users. Therefore existing tags cannot be used to develop a common communication platform. On the other hand, such a platform can only be built with one tag, only if the access control to the stored data can be provided depending on the roles or the access privileges. Then, it will be possible to subscribe many services or allow different reader to read authorized portions of the tag and thereby the cost can be reduced while enhancing the management. Therefore, using one tag with above features and secure communication protocols, will lead to innovate novel era of ITSs enabling ultimate infrastructure support for autonomous driving in future. Nevertheless, there are no major weaknesses of current systems. The goal of our research is to develop a novel RF Tag architecture that supports and enhance as many as possible applications in ITSs. To achieve the above goal we set following three main objectives: first is to design a novel radio frequency tag architecture with the ability of being intelligent, stand alone, self-describing, and plug & playable while enabling role based access controlling mechanisms. Second is to define security and privacy enhanced application layer protocols to communicate among users, vehicles and infrastructure including vehicle to vehicle. The third objective is to derive electronic vehicle registration and ownership transferring mechanism to improve the flow of tagged vehicles throughout their lifecycle. The remainder of this paper is organized as follows. Section 2 describes the related works on RFID and ITS. OTag architecture is described in Section 3 whereas Section 4 explains how OTag help to improve innovative ITSs. Finally Section 5 concludes the paper with future works and remarks.

2 Related Works Existing applications are using conventional RFID tags and store only a unique number and the data related to the tag is taken from the proprietary databases. Furthermore interoperability and plug and playable features are not supported. 2.1 Automated Toll Collection Systems ETC, an adaptation of military "identification friend or foe" technology, aims to eliminate the delay on toll roads by collecting tolls electronically. It determines whether the cars passing are enrolled in the program, alerts enforcers for those that are not, and electronically debits the accounts of registered car owners without requiring them to stop [1,2,3,5]. Enforcement is accomplished by a combination of a camera which takes a picture of the car and a radio frequency keyed computer which searches for a driver’s window/bumper mounted transponder to verify and collect payment. The system sends a notice and a fine to cars that pass through without having an active account or paying a toll [1, 5]. Norway has been the world's pioneer in the widespread implementation of this technology. ETC was first introduced in Bergen, in 1986, operating together with

Object Tag Architecture for Innovative Intelligent Transportation Systems

317

traditional tollbooths. In 1991, Trondheim introduced the world's first use of completely unaided full-speed electronic tolling. Norway now has 25 toll roads operating with electronic fee collection (EFC), as the Norwegian technology is called (see AutoPASS). In 1995, Portugal became the first country to apply a single, universal system to all tolls in the country, the Via Verde, which can also be used in parking lots and gas stations. The United States is another country with widespread use of ETC in several states, though many U.S. toll roads maintain the option of manual collection [1, 5]. Electronic toll collection systems rely on four major components: automated vehicle identification, automated vehicle classification, transaction processing, and violation enforcement [1, 5]. 2.2 Automated Parking Management Systems Automated parking management systems offer great benefits to owners, operators, and patrons. Some of the main benefits include reduced cash handling and improved back-office operations, high scalability, automatic data capture and detailed reporting, improved traffic flow at peak hours, improved customer service, cash-free convenience, and also provinces to arrange special privileges for VIP customers such as volume discounts, coupons, and other discounts for daily parkers [1, 4, and 7]. RFID enabled automated parking access control systems eliminating customers' need to fumble for change, swipe cards, or punch numbers into a keypad. Vehicles can move smoothly through controlled entrances, and more parkers can be accommodated, thereby increasing revenues. Since there are no cards to read, tickets to read and sort, and the whole system is a convenient, hands-free way to ensure easy vehicle parking [1, 4, 7]. TransCore is one of the pioneers to provide parking access control systems which offer using RFID technology. They uses proven eGo® and Amtech®-brand RFID technology to deliver reliable, automated parking access control systems solutions. TransCore's parking control systems facilitate airport security parking and parking at universities, corporations, hospitals, gated communities, and downtown parking facilities [1, 4, 7]. ActiveWave is another company who provide such systems using RFID technology. In their system, surveillance cameras or video recorders can be triggered whenever a vehicle enters or exits the controlled area. [1, 4, 7]. Although there are number of companies who provide RFID based parking management systems, almost all the systems are proprietary, and provide no interoperability. Some of them are using passive RFID systems. As described in Section 1, there is no stand alone, self-describing, plug and playable, interoperable and access controlling mechanisms built into those tags. Hence, all the existing and previously proposed RFID tags are not suitable for the real requirements that are demanded by ITSs.

3 OTag Architecture OTag Architecture involves two main parts. One is the tag design and the other is common communicational protocols. OTag is designed to represent real world objects in radio frequency (RF) Tags. This is the very first time to use concepts of Object

318

K.S. Koralalage and N. Yoshiura

Oriented Architecture to develop RF tag. OTag has got its own attributes and methods to stand-alone by itself. OTag differs according to expected characteristics and behaviors of the real world objects. For instance, implementation of vehicle OTag is different from road symbol OTag. In other words the attributes and the methods are defined according to Meta class of the object. As explained in above example, the attributes of a vehicle OTag differ from the attributes of a road symbol OTag. Additionally the implementation of the get and set methods may also differ according the Meta class of an object. OTag can be considered as an instance of an object class. If a vehicle is the Meta class of the object, tags attached to all vehicles are considered as instances of vehicle object class. Furthermore, each attribute of the OTag has got access modifiers to manage the access based on roles. They are categorized as public, friendly, protected and private. Design of such attributes will be carry out by considering the characteristics and behaviors of the Meta class. Note that the access modifiers used here are not exactly the same as in object oriented concepts. Access modifiers can be considered as roles, and one of the four access modifiers is assigned to each attribute. With the help of those modifiers, OTag acts four roles: public, friendly, protected and private. Those roles can be used according to the usage of the lifecycle of object class such as lifecycle of a vehicle class. Public means no security and any reader can read any attribute value of public area. Private, Protected, and Friendly attributes of the OTag need secure access. Furthermore the writing permission is also available for these areas, but that is controlled by both keys and memory types. For example write once memory is used for some attributes in friendly area to prevent updating fixed attribute values whereas rewritable memory is used for dynamically changing attribute values according to the behaviors of object class. Details of each role definition are described in later paragraphs.

Fig. 1. Illustrate the Instantiation Process of Vehicle OTag from Meta Class

Figure 1. represents the process of instantiation. Any real world object should contain its own characteristics and behaviors. In other words each object should have attributes and methods to access those attributes. Tag object has such attributes and methods with access modifiers. When defining a class from object, the class name will be given. Therefore, OTag gets its class name such as vehicle which is a class level attribute, plus implementations of get and set methods that are defined at the time of fresh OTag class creations. In other words, when a vehicle class OTag is created, it will consist of vehicle attributes and vehicle method implementations inside

Object Tag Architecture for Innovative Intelligent Transportation Systems

319

the tag. Such a tag becomes an instance after filling the attribute values of a particular vehicle. Attaching OTag into a vehicle and populating its attribute values make the proper instance of vehicle OTag. Similarly other objects, classes, and instances can be defined. Once the OTag instance is created, that instance of OTag becomes selfdescribing and stand alone using the features of object oriented architecture and radio frequency technology. Role base accessing methods are implemented using access modifiers, memory types and encryption algorithms. Memory types used here are ROM, EPROM, and EEPROM. OTag controls writing permissions according to the process flow by using these three types of memories. ROM is used for read only data and they are created by time of fresh tag manufacturing. EPROM is used for one time writing data and they are all about product information. EEPROM is used for all the rewritable data and those data may be changed in any stage of the product lifecycle by relevant authorities. For example, extension of validity of inspection period can be written into the rewritable attribute called “Inspection” as sown in figure 2. There can be several Object classes in ITS, such as Vehicle OTag, Informer OTag, Signal OTag, Lane OTag, Lot OTag, Road Symbol OTag, etc. 3.1 Logical Structure and Roles of Vehicle OTag Hereafter OTag is explained using the vehicle OTag instance. Figure 2. represents the logical structure of vehicle OTag. For other types of OTags such as road symbol OTag, attribute names, get method, and set method implementations are different from vehicle OTag. Additionally the oName attribute value should be class name, such as oName for road symbol will be “SYMBOL” instead of “VEHICLE”. Therefore the other OTags can be developed using the same architecture only by changing the attribute name value pair and implementations of get and set methods according to the characteristics and behaviors of those classes. Figure 2. also illustrates the memory types, access modifiers, and respective keys of each role. Encryption algorithm used hers is AES-128 stream cipher. OTag contains methods, data “Initial”, AES-128 algorithm and processing module.

Fig. 2. Illustrate the Logical Structure of an OTag instance in Vehicle

320

K.S. Koralalage and N. Yoshiura

OTag generates nonce NI whereas the reader generates three nonce values NT, IDI, and IDT to carry out proper mutual authentication. In addition that three role keys and PIN are stored to ensure the security of data. Four areas of four roles are marked with A, B, C and D. Each area is used by different agents throughout the lifecycle of the vehicle. Four areas of the vehicle OTag can be used as follows. Information stored in the public area can be read by any reader whereas the friendly, protected and private areas are secured with reading and writing permission. Only the vehicle owner who is granted with private role can write in public area. Private role is also granted to read any attribute value of own vehicle’s OTag though the writings of protected attributes are restricted. The protected area is used by a government authority such as department of motor vehicle registration. Attributes in friendly area can be read or written only with the friendly key. In vehicle class, owner has to manage the friendly area in addition to private area. Therefore the owner of a vehicle needs to keep private key, friendly key and PIN with him. Private key and PIN are used to secure the ownership of the vehicle. If the owner needs to change some attribute value in friendly area such as his WalletID, he must use the friendly key. Public area stores the object class name, type, anonymous ID, intention and customizable attribute called PblAttrib01. Information stored in this area can be used to understand object and its movements. Identifying those public information leads to create applications like collision avoidance, traffic signal management, vehicle to vehicle communication, road congestion management, road rule enforcements, driving assistant systems, etc. Protected area stores the color, model, manufactured year, frame number and type, engine, capacity, weight, size, license plate number, date of first registration, inspection validity, and tax payment status. Additionally it also has got two customizable attributes named as PrtcdAttrib01 and PrtcdAttrib02. Information stored in this area is devoted to the vehicle governing authority. Therefore, after first registration the data stored in this area can only be manipulated by government authority. Protected information can be read only by the owner or police. Inspection validity, insurance, tax, etc. help to identify the current status of the vehicle. Recognizing illegal, fake, clone, stolen or altered vehicles, issuing fines for inspection expired vehicles, verification of tax payments, management of carrying garage, temporary and brand new vehicles are some of the main applications using this area. Friendly modifier allows several services to be catered to the user in effective manner. This area stores the pay ID or wallet ID, rider mode, two customizable attributes named as FrndAttrib01 and FrndAttrib02. Information stored in this area can be used to subscribe variety of services provided by companies. Any registered service provider can use this information to provide the comfortable service to the vehicle user. Electronic fee collection systems like toll collection and parking can use WalletID to collect the fee only after prior registration with relevant authorities. Furthermore, the emergency vehicles like ambulances can use this area to prioritize traffic signals by using the value of RiderMode attribute. Private area stores owner name, address and one customizable attribute named as PvtAttrib01. Information stored in this area is to prove the ownership of the vehicle. No one can read or write data into this area without the permission of the owner. When the vehicle is sold the ownership information will be changed.

Object Tag Architecture for Innovative Intelligent Transportation Systems

321

Unlike conventional RFID tag, OTag can manage its access depending on four roles: public, friendly, protected and private. Thus it provides role base access control mechanism. Additionally, OTag eliminates the necessity of accessing database by keeping its own data with it and thereby guarantees the stand alone capability. Similarly OTag assures ability of self-describing using the characteristics and behaviors of RF technology. Furthermore the interoperability is guaranteed in OTag by providing the common communication interface. Therefore the plug and playability is supported using the above three main points allowing any actor in ITS to communicate with OTag. 3.2 Communication Protocols OTag has five main protocols: non-secure reading, secure reading, secure writing and key updating and ownership transferring. Transferring ownership of OTag is a combination of above protocols. It is only used in authorized centers since specific readers will carry out this task. Protocol notations are as follows. Interrogator and Tag are represented as I and T respectively. Kprv - Private Key (128bits) Sprv - Shared Key or PIN (48bits) Kpro - Protected Key (128bits) Kfrn - Friendly Key (128bits) PIN - Personal Identification Number (48bits) NI - Nonce generated by Interrogator (40bits) NT - Nonce generated by Tag (40bits) IDI -Interrogator generated ID (16bits) IDT - Interrogator generated ID (16bits) Initial- Publicly defined initial message (16bits) R - Response value – Attribute Value, or Successful/Failed [1/0] {M}K - Message “M” encrypted by Key “K” using AES128 steam cipher algorithm Non-secured (public) reading protocol. Non-secure reading protocol is used for public reading. Any reader can query the public attribute values by passing the attribute name to the OTag. For that attribute names of the given object class must be available for the reader to query the attribute value. For instance, attribute name list of vehicle class should be available with the interested readers, which can query “Type” attribute value from OTag. Then the OTag will answer with value “CAR”. I Æ T: T Æ I: I Æ T: T Æ I:

oName=? oName=“VEHICLE” Type=? Type=“CAR”

Secured (private, protected and friendly) reading and writing protocol. Each role attribute value should be managed securely. Only successful mutual authentication allows reading. Here the KIT denotes the encryption key used in each party. For instance, KIT denotes the private key in private reading whereas KIT denotes the protected key in protected reading. Reading is only granted for authorized parties and

322

K.S. Koralalage and N. Yoshiura

they can only read the data which belong to them. To ensure security of each reading messages are encrypted with authorization key. I Æ T: T Æ I: I Æ T: T Æ I: I Æ T: T Æ I:

oName=? oName=“VEHICLE” Initial{Initial, NI, IDI, IDT}KIT IDI { IDT, NI, NT}KIT IDT { IDT, NT,0, AttribName=“color”}KIT IDI { NI, IDT, R=“WHITE”}KIT

Secure writing protocol is described as follows. I Æ T: T Æ I: I Æ T: T Æ I: I Æ T: T Æ I:

oName=? oName=“VEHICLE” Initial{Initial, NI, IDI, IDT}KIT IDI { IDT, NI, NT}KIT IDT { IDT, NT,0, AttribName=“walletID”, R=“123drt412G3”}KIT IDI { NI, IDT, R=“0/1”}KIT

Secure Key Updating. Unlike secure writing, it is necessary to confirm the keys to be updated before actual key updating because the communication will be impossible if a wrong key is set. Therefore, there are two more passes in this protocol other than secure writing. Here the KN is for new key value. I Æ T: T Æ I: I Æ T: T Æ I: I Æ T: T Æ I: I Æ T: T Æ I:

oName=? oName=“VEHICLE” Initial{Initial, NI, IDI, IDT}KIT IDI { IDT, NI, NT}KIT IDT { IDT, NT,1, AttribName=“prvKey”, R=“KN”}KIT IDI { IDI, AttribName=“prvKey”, R=“KN”}KIT IDT { IDT, Confirm=“0/1”}KIT IDI { NI, IDI, R=“0/1”}KIT

Transferring the Ownership. This process will be carried out by specific readers after deleting all the personal information of the predecessor and updating the ownership to successor without creating any security and privacy issues [8]. 3.3 Lifecycle of OTag As shown in Figure 3. Step 1 creates fresh OTags according to the object class. Each class has got different attributes as of the real world object. In this step, the tags contain the attribute names “OName” and “AnonymousID”. Then the OTag instance for vehicle class is created in step 2. It contains the attribute names relevant to vehicle class. Only the values of attribute names “OName” and “AnonymousID” are filled in vehicle OTag before attaching to a vehicle. The value of “OName” is “VEHICLE” and that of “AnonymousID” is a random unique number. Next the OTag is passed to the vehicle manufacturer.

Object Tag Architecture for Innovative Intelligent Transportation Systems

323

Fig. 3. Illustrate the flow for the OTag embedded vehicle lifecycle

Step 3 when vehicle manufacturers receive the vehicle OTag instances with empty attribute values except “OName” and “AnonymousID”. Three role keys and the PIN are set to their secrets. All the relevant instance attributes will be fed to the OTag. For example Toyota may request 1000 vehicle OTags to attached in brand new vehicles and then feed the instance attribute values such as vehicle type, frame number, engine etc as shown in Figure 4. Then the vehicles are passed to the dealers after transferring the ownership to the dealers. In step 4 vehicles in dealers possession contained all information fed by the manufacture but the ownership information, role keys and PIN are changed to dealers’ secrets. When a customer bought a brand new vehicle as in step 5, first registration will be carried out. By that time the protected role key will be handed over to the

Fig. 4. Illustrate the sample flow for the vehicle OTag throughout its lifecycle

324

K.S. Koralalage and N. Yoshiura

vehicle governing authority while friendly, and private role keys with PIN will be changed to the customers’ secrets as shown in Figure 4. the ownership information will also be changed to his own details. Since the customer has the friendly key to allow secure communication between desired service provider and his own vehicle, services like electronic toll collection, parking payments, gasoline payments, etc. can be subscribed easily. Step 6 and step 8 emphasize the usage of the vehicle. During the usage period customer may extend the inspection period, pay taxes, etc. Then the relevant OTag attribute values may be updated as shown in Figure 4. When the customer needs to sell the vehicle to someone, ownership information should be transferred by visiting the authorized centers’ readers. In step 9 if the vehicle may be no longer needed, cancellation of registration can be done in the same center and pass the vehicle for recycling as shown in step 10. By the time of de-registration only the minimum required information will be kept and other information including personal data will be deleted to protect security and privacy of users as shown in de-registration tag information in Figure 4. Then the recycling company also can use RF communication to improve their process of gathering information on recycling units. Note that each user’s ability of writing is also controlled by using three memory types: ROM, EPROM and EEPROM.

4 ITS Applications with OTags Suppose that each vehicle has got a RF reader and vehicle OTag. Then the OTag act as a brain whereas interrogator acts as ears and mouth making this arrangement as an intelligent interface of vehicle. Therefore, OTag enables large array of applications to improve ITSs. As explained in Section 3, OTag inherits all the characteristics and behaviors of RF communication. Additionally, OTag is intelligent, stand alone, self describing, plug and playable, and interoperable with role base accessing system that allows to communicate with different actors. OTag can be used to represent most of the agents in the ITS system. Each Object that interacts with ITS system can be modeled as an OTag.

Fig. 5. Illustrate the readers and tags in ITS

Object Tag Architecture for Innovative Intelligent Transportation Systems

325

Figure 5 illustrates the readers and tags in ITS who are interested in communicating with the vehicle. External readers can read OTag in vehicles and the readers in vehicle can read infrastructure and passenger or pedestrian OTags. Installing the infrastructure and vehicles with OTags and interrogators makes whole lot of possibilities. This paper concentrates on Vehicle to Vehicle communication and Infrastructure to Vehicle communication explaining two examples for each case. 4.1 Identifying and Expressing the Intention of the Vehicle Identifying the current position and intended action including driving direction is very important to avoid fatal accidents. For that each vehicle should express its intended movements at least within a given block or time slot. The following section explains how a vehicle can express its intention which can avoid collisions. There is an OTag class called Informer. Informer instances are positioned in the center line of the road in a way that can be read by vehicles running towards the both directions. Those tags contained the next target, current position, current route, and next crossing lane, etc. As shown in Figure 6, once the immediate previous 16.1 Informer OTag position is passed by a vehicle, intended action “16.1GF” of that vehicle is sent to the 16.2 Informer OTag via the vehicle reader requesting the next target information. Then the Informer OTag sends the next target information to the vehicle. If there is no previous Informer, the reader in the vehicle requests yourPosition attribute of first Informer OTag to understand the current position.

Fig. 6. Illustrate the identification layout of intended movements of vehicles

When a vehicle passes the very first Informer, the interrogator in the vehicle gets to know the next immediate target. After that, the interrogator writes that information with intended action, running speed, and expected reach time of current immediate target and lane number if it exists, to the public area of the vehicle OTag. If there is no lane number, or previous position information before the Informer, the predefined not applicable dummy values will be used to compose the intention attribute. Once the vehicle OTag is filled with intention, it will start expressing the intended movements as shown in Figure 6. This process continues until the no Informer tag is found. Whenever changes happened to the values relevant to the parameters used to compose

326

K.S. Koralalage and N. Yoshiura

above message, recalculation is done and vehicle OTag starts retransmitting the new message. For instance, if the speed is changed the recalculation is done and after that the retransmission starts. A target message interpretation is represented in Figure 7. Here the intended action is divided to eight categories: Turn Left (TL), Turn Right (TR), Go Forward (GF), Go Backward (GB), U-Turn (UT), Hazard Stop (HS), Parked (PK) and Emergency Accident (EA).

Fig. 7. Target message expressed by moving vehicle

The message in Figure 7 is interpreted as a vehicle is moving over the route number 16 by passing the Informer position 16.0 and heading forward (GF) to 16.1. Current estimated reach time to the 16.1 is 12:02:23:122 and the vehicle is running at a speed of 60kmph in lane 0. Here the lane 0 means that the road 16 has only one lane for one side. 4.2 Vehicle to Vehicle and Vehicle to Infrastructure Communication As explained in the above message, one vehicle can understand the intended movements of surrounding vehicles. Therefore the vehicle to vehicle communication is possible and thereby collisions can be avoided when crossing an intersection and merging lanes. Though the prior detection is very important, guaranteeing the avoidance is not an easy task but the consequences due to collisions can be reduced. How to Detect Collisions. Because of page limitation, we explain only about collision detection at intersection though the same method can be used to avoid collisions when merging. Consider the route number of the vertical route as 17 and horizontal route as 16 as shown in Figure 8. Suppose that the vehicle V1 is moving towards the intersection passing the 17.4 position in the route 17. V1 explains others about his intended movements in his message as shown in the same figure. This message means that the V1 is heading forward (GF) his next immediate target 16.17 by passing the 17.4 over the route 17 and he is supposed to reach the target by 12:02:24:122. Similarly, in his assistant screen there will be an image similar to Figure 8 explaining how the other vehicles are coming towards the intersection and also leaving the intersection by interpreting their intended movement messages. Since real time calculation is done and also every change is calculated once change occurs to any of the parameters used to create the message, reach time of each vehicle can be estimated accurately. If two or more vehicles are reaching a same target by the exact time, there is very high possibility to collide them. Therefore once the reader in the vehicle could understand such situation, the driver can be warned or asked to take precautionary steps to avoid the predicted collision.

Object Tag Architecture for Innovative Intelligent Transportation Systems

327

Fig. 8. Illustrate detecting collision in intersections

In case of turning vehicles, the message explains their intentions by setting the composed value with TL or TR. Suppose that the vehicle V1 is going straight and the vehicle V2 is going to take a V1’s left, V1 must understand the possible collision and take precautionary steps. If V2 is going to take the turn to right side of the V1, no collision will be happened. Thus quick decision can be taken and smooth, safe and fast crossing of intersection can be realized with OTag. How to Prioritize Emergency Vehicles. Suppose there is a reader in the intersection traffic signal post and it knows the friendly keys of the emergency vehicles. In an emergency situation the OTag in emergency vehicle can change its RiderMode attribute value to "URGENT" to describe its urgency. When an emergency vehicle is reaching the intersection, the public attribute values of Intention and the Type describes its intention and the type to surrounding readers. Then the reader in the intersection signal post can understand the vehicle type and intention of the emergency vehicle. If the vehicle type describes one of four emergency vehicles such as ambulance, rescue, fire brigade or police, signal post reader will check the urgency by reading emergency vehicles RiderMode attribute after carrying out proper authentication using the registered friendly keys. If the verification process could pass, the signal post reader understands the situation and change the color to green depending on the intended moving direction of the emergency vehicle to pass the intersection as fast as possible. On the other hand drivers in the normal vehicle are informed that the emergency vehicle is approaching using vehicle to vehicle communication described in above sections. Thus they can cooperatively help to pass the emergency vehicle without any tension, delay or accident. No vehicle can use replay message to impersonate as an emergency vehicle since the protocols uses the nonces. Additionally, since it is not possible to discover the registered friendly keys easily, impersonating will be very difficult. Similarly forward secrecy also can be achieved by changing friendly keys and re-registering in case of failure.

328

K.S. Koralalage and N. Yoshiura

How to Enhance the Movements at Intersection. Varying traffic volumes during the peak hours and midday makes it very difficult to enhance traffic signals. Similarly the areas that experience heavy traffic congestion, needed traffic signal timing improvements to implement effective traffic flow as well as air quality and fuel consumptions. Currently several methods are being used to detect and count the vehicle coming towards the intersection. Several systems are capable of monitoring the traffic arrivals and adjusting timings based on the detected inputs. Traffic detectors may range from metal detectors, infrared readers, image detectors, etc. Metal detectors are the most popular in use though they provide minimum information. Image detection devices exhibit numerous problems including degradation during bad weather and lighting. Consider that there is a reader who has a reading distance of 100meters radius is installed in the traffic signal post in center post of the intersection and all the vehicles are embedded with OTag with the ability of expressing intended movements. By reading this message the reader in the traffic signal post understands the number of vehicles, in its range, their intended directions and the number of vehicle types including the availability of emergency vehicles or common transportation units such as bus, etc. Depending on the policy of the country, traffic optimizing algorithm can be implemented considering above knowledge. Unlike in other conventional intelligent traffic systems, an extra knowledge can be mined and that knowledge can be used to optimize the traffic signal adaptively because OTag can provide detail information to take a better decision. Additionally if a tag is installed in the signal and set to describe starting and ending phases of transition to incoming vehicles, vehicle stopping can be made smooth and misinterpretations can also be minimized. Other Possibilities. As traffic signs provide the driver various information for safe and efficient navigation, one of the important areas is representing traffic signs on OTag. Then the automatic recognition of traffic signs can be provided to support for automated driving or driver assistance systems. Therefore the reader in vehicle can understand the road sign and explain the driver or control the vehicle according to the instructions. OTag can also be used to control the speed of a vehicle according to the road condition or climate changes. Similarly, there are scenarios where one vehicle is trying to enter a single lane without seeing the coming vehicle due to building etc. In such situations, unnecessary movements of vehicles creating traffic congestions can be mitigated with minimum infrastructure. Not only that but also electronic vehicle registration, ownership transference, enforcement of rules and regulations can automated without human intervention while many more innovative applications can be developed as OTag also has got a user memory area.

5 Concluding Remarks OTag is introduced to improve the existing intelligent transportation systems enabling large array of novel applications. The architecture of the OTag, communication protocols, capabilities, and usage are being explained. Collision detection in intersections and merging points, traffic signal prioritizing for emergency vehicles, intersection traffic signal improvements are discussed under vehicle to vehicle and vehicle to infrastructure communication.

Object Tag Architecture for Innovative Intelligent Transportation Systems

329

In future we will release the other infrastructure tag designs by complying with OTag using the same protocols proving the interoperability, self-describing ability, ability to be stand alone, and plug and playability with role base access control mechanisms.

References 1. ITS, http://www.esafetysupport.org/, http://www.ewh.ieee.org/tc/its/, http://www.ertico.com/ (accessed on Feburary 2009) 2. Active RFID Tag Architecture, http://www.rfidjournal.com/article/view/1243/1/1 (accessed on Feburary 28, 2009 ) 3. Toll collection systems, http://en.wikipedia.org/wiki/List_of_electronic_toll_ collection_systemsaccessed (accessed on Feburary 28, 2009 ) 4. RFID Parking Access, http://www.transcore.com/wdparkingaccess.html (accessed on March 02, 2009 ) 5. ITS Japan, http://www.mlit.go.jp/road/ITS/2006HBook/appendix.pdf (accessed on Feburary 20, 2009) 6. e-Plate and RFID enabled license plates, http://www.e-plate.com (accessed on July 2008) 7. Safety Applications of ITS in Europe and Japan, International Technology Scanning Program report, American Association of State Highway, and Transportation Officials NCH, Research Program 8. Sabaragamu Koralalage, K.H.S., Selim, M.R., Miura, J., Goto, Y., Cheng, J.: POP Method: An Approach to Enhance the Security and Privacy of RFID Systems Used in Product Lifecycle with an Anonymous Ownership Transferring Mechanism. In: Proc. SAC, pp. 270–275. ACM Press, New York (2007)

Conceptual Universal Database Language: Moving Up the Database Design Levels Nikitas N. Karanikolas1 and Michael Gr. Vassilakopoulos2 1

Department of Informatics Technological Educational Institute of Athens, Greece [email protected] 2 Department of Computer Science and Biomedical Informatics University of Central Greece, Lamia, Greece [email protected]

Abstract. Today, the simplicity of the relational model types affects Information Systems design. We favor another approach where the Information System designers would be able to portray directly the real world in a database model that provides more powerful and composite data types, as those of the real world. However, more powerful models, like the Frame Database Model (FDB) model, need query and manipulation languages that can handle the features of the new composite data types. We demonstrate that the adoption of such a language, the Conceptual Universal Database Language (CUDL), leads to higher database design levels: a database modeled by Entity-Relationship (ER) diagrams can be first transformed to the CUDL Abstraction Level (CAL), which can be then transformed to the FDB model. Since, the latter transformation has been previously studied, to complete the design process, we present a set of rules for the transformation from ER diagrams to CAL.

1 Introduction and Motivation A common methodology that is used up to today in the design of relational databases is the specification of a set of attributes in (usually) one and universal relation and the specification of a set of functional dependencies and afterwards the decomposition of the set of attributes into smaller relations which consist of subsets of the original set of attributes, in order to eliminate update anomalies and reduce data redundancy. A speculation of the methodologies of data analysis is that we do not know the structure of information that we were called to impress in an Information System. For this reason we begin with interviews of the persons involved in the operation of a non-computerized system and from this process we arrive in a series of fundamental information elements (attributes). The correct correlation and grouping of the fundamental attributes is a next stage of data analysis. This speculation and technique of data analysis has also influenced the design of relational databases. The normalization is the process that aims to produce the best from a (by nature) weak data model: “the relational model is limited with respect to semantic content (i.e., expressive power) and there are many design problems which are not naturally expressible in terms of relations” [18]; “The relational model is weak when showing many-to-one relationships” [16]. J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 330–346, 2009. © Springer-Verlag Berlin Heidelberg 2009

Conceptual Universal Database Language: Moving Up the Database Design Levels

331

The real world that we are called to impress with an Information System (often with the use of a database) seldom incorporates repetitions of data (data redundancy). As an example, consider a non-computerized managed library, where we do not incorporate multiple series of books (copies) and a corresponding number of bookshelves, in order to place the first set of book copies sorted according to the title, the second set of book copies sorted according to the first writer, the third set of book copies sorted according to the theme category, etc. On the contrary, we do not use any ordering (more precisely we use the ordering of books according to the date of import into the library) or use some of the orderings that interest us (usually thematic) and in addition we create indexes (with cards) for every one of the ordering that interest us. Each card contains the key of the classification and a reference to the natural ordering (the location of book in the bookshelves). Thus, we claim that the Information System design should not decompose the real world (that we were called to impress in an Information System) in its fundamental characteristics and afterwards to proceed with simple compositions of characteristics that relational model allows. We choose another approach, where the Information System designers are able to portray directly the real world in a model that provides more powerful structures, as those of the real world. Section 2 provides a review of such models. Some of them could be used instead of the relational model. However, having more powerful database models, but still using data manipulation languages designed only for fundamental data attributes, is a waste of time and resources. Therefore, in Section 3, we argue that there is a necessity for a database query and manipulation language able to manipulate directly the composite (real world) data types. Section 4 gives details of the Conceptual Universal Database Language (CUDL), a language presented in previous work [9,10,11,21], which satisfies the mentioned necessity. In Section 5, we argue that the use of CUDL leads to higher database design levels: a database modeled by Entity-Relationship (ER) diagrams can be first transformed to the CUDL Abstraction Level (CAL), which can be then transformed to the Frame Database Model (FDB) model [19,20], before being implemented at the physical level. The transformation from CAL to FDB has been studied in previous work [9,11]. In order to complete the design process, in Section 6, we present a set of rules, dealing with the most common cases that appear in real life applications, for the transformation from ER diagrams to CAL.

2 Background 2.1 Generic Data Modeling The Generic Data Modeling [7,4] approach is an outcome mainly emanate from research in the Medical Informatics domain. The fact that, in case of Health Care data maintenance, the amount of information and the complexity of information lead to a huge (Daedal / mazy) conceptual schema, concerned the Medical Informatics scientists. Moreover, the fact that the direct production of a logical schema for a relational DBMS, from a given huge conceptual schema, obviously conserve this Daedal characteristic, gave raise for research for alternative data modeling approaches. Another inherent characteristic of relational logical schemata is the difficulty for supporting data evolution

332

N.N. Karanikolas and M.G. Vassilakopoulos

(changing information needs). The research results for both problems (Daedal conceptual and logical schema and difficulty for data evolution in relational data) reveal the generic data modeling approach. This approach defines two generic transformations (namely “flattening” and “relation merging”). The later transformation (“relation merging”) is also the basis for supporting data evolution. These transformations, when applied to the original conceptual schema, produce a generic logical schema consisted of a reduced number of tables. However, this process is not a very strict procedure and actually it is depended from (the personal perception and) the choices made by the person who guides the process and applies the mentioned generic transformations. The final number of tables, as the outgrowth of the transformation process, is dependent from the choices made and is not a concrete (predefined) set of tables (as happens in other cases, e.g. in the FDB data modeling). The disadvantage of Generic Data Modeling is that querying the resulting generic logical schema with standard SQL requires multiple statements and considerable intellectual effort, especially when the queries are intended to retrieve data for feeding data analysis tasks (e.g. feeding data mining applications). To overcome such difficulties, researchers have defined the Extended Multi-Feature (EMF) SQL extension [8] that provides simpler to understand, more compact and more efficient query constructions. 2.2 EAV Data Modeling The Entity Attribute Value (EAV) data modeling [14,15] is also an outcome from research in the Medical Informatics domain. The motivation for the research that revealed the EAV data modeling was that, in the medical domain, the number of parameters (facts) that potentially apply to any clinical study is vastly more than the parameters that actually apply to an individual clinical study. For example, the potential number of laboratory examinations that a patient could be submitted to is a huge superset of the actually submitted examinations in a specific medical case (e.g. a patient suffering from a bilestone). Another reason, that motivated the research that revealed the EAV data modeling, was that clinical studies are subject to evolution as a result of medical research. As a consequence, the number of clinical parameters related to a clinical study is always differentiated (and, in most cases, are increased). Thus, the data model should be able to host new clinical parameters for any clinical study, without the need for data (structure) reorganization. The research, motivated by the above-mentioned reasons, revealed the EAV data modeling. According to EAV design, metadata and data tables compose the logical database schema. The facts (that actually apply to a clinical study) are recorded into the data tables, as a triplet: the Entity, the Attribute and the Value. The Attribute is the recorded fact (clinical parameter) and the Entity is a composition of the relevant patient’s identifier and some timestamp. The metadata tables are used to define the data composition (which clinical parameter, i.e. Attribute, pertains to which clinical study). There are three main versions of the EAV data modeling [3] but all of them share the same basic principle (the triplet: Entity / Object, Attribute, Value). Another interesting feature of the EAV models is that they permit mixture of EAV stored and conventionally stored data. However, the existence of this heterogeneity complicates significantly the task of data querying. We should also mention that the EAV data modeling support facts evolution (equivalent to the addition columns in a relational table without the need for any reorganization) but does not support table (entity) evolution.

Conceptual Universal Database Language: Moving Up the Database Design Levels

333

2.3 FDB Data Modeling In previous works (by Yannakoudakis et al.) [19,20] there has been an investigation of dynamically evolving database environments and corresponding schemata, allowing storage and manipulation of a variable number of fields per record, variable length of fields, composite fields (fields having subfields), multiple values per field (either atomic or composite), etc. The ultimate goal of the research work of Yannakoudakis et al. was to make the design and maintenance of a database a simpler task for database designers, so as that they will not have to put in a lot of effort to design the database and later they will not have to pay special attention and work for database changes. Their research proposed a new framework for the definition of a universal logical database schema that eliminates completely the need for reorganisation at both logical and internal levels, even when major modifications of the database requirements have occurred. This new framework was called Frame Database Model (FDB) [19]. This Universal logical database schema is based on well and strictly defined set of Metadata and data tables and it does not permit any mixture with conventionally stored data. All the entities that are available to the user, along with their attributes, are documented exclusively in the metadata tables and the facts concerning the instances of the entities are recorded exclusively in the data tables of the FDB Universal schema. Another noteworthy feature of FDB is that it supports Schema evolution, both for facts and entities. Moreover the FDB model allocates ways of imprinting strictly connected (Hardly related) information with innate (inherent/native) mechanisms, in contrast to the relational model that compels the creation of artifact structures (tables) to represent strictly connected (Hardly related) data. As an example, the relational model requires the creation of new table to store data that relate of the form one-to-many (the addresses or the telephones of customer). In contradiction to the relational model, the FDB model can maintain the same information with a field that is accommodated in the side of the “one” and accepts multiple values. Even more complex forms of strictly connected (Hardly related) information, are impressed, in the FDB model, with innate (inherent/native) mechanisms. For example a correlation of information with a form many-to-many (as are the DVDs that have been rented to a member of a Video Club) is maintained in one of the two connected sides without requiring the creation of a new table to correlate the information. That is to say, in the many-tomany correlations we follow a mechanism that emanates from the real world (in the example of the Video Club we maintain inside the card of a customer a table with his/her renting of DVDs). The most important fact in the FDB model is that it organizes information without any repetition of values. In order to be more precise, not only it does not proceed in repetitions but it ensures that these cannot be created. In the example of the addresses (or alternatively the telephones) of a customer the basic data of a customer (let us say name, surname and code) are stored once and all the different addresses that the customer may have are stored in the field addresses. That is to say, the use of a single big (universal) table to store/repeat as many times the basic attributes of a customer (name, surname and code) as the number of his/hers addresses is avoided naturally (without any effort). Thus the FDB model provides as an inherent feature the no redundancy property.

334

N.N. Karanikolas and M.G. Vassilakopoulos

2.4 Not First Normal Form (NF2) or Nested Relational Data Modeling The motivation for developing the Nested Relational data model was the fact that the Relational model has difficulties of modeling the real world; It is also inconvenient for handling even simple data structures commonly used in Information Retrieval. To overcome these problems, Researchers have proposed a relational model where Non First Normal Form (NF2) relations are allowed [17]. This extension encompasses the classical 1st Normal Form (1NF) model and adds, to the relational algebra, two basic operators (namely “nest” and “unnest”). Based on the “nest” operator, this proposal allows sets (as the result of “one-attribute” nest operation) and sets of sets (as the result of “multi-attribute” nest operation) as attribute values. NF2 sets are equivalent to simple FDB fields with repetitions and NF2 sets of sets are equivalent to composite FDB fields with repetitions. The researchers have also proposed a query language extension for NF2 table definition and manipulation. However, the NF2 presents some weak points: – It does not support Entity or Fact (Attribute) evolution – It does not have Universal logical schema – The proposed query language extension only undertakes (be engaged in) Retrieval statements – This Retrieval statements are rather suggestions or hypothetical statements and are not parts of a mature language that handles relational tables with non-atomic (sets and sets of sets as) attribute values – The notion that governs the whole idea, which has passed through and is reflected by the proposed query language extension, is that the subfields (of composite fields) are not directly accessed by the user. – Related to the previous point is that the proposed query language uses Nested Select statements, whenever a restriction over a subfield should be applied Possibly, these weak points have the consequence that, after 26 years, Non First Normal Form does not seem to be implemented as a DBMS. However, some researchers are still interested [12] and define languages supporting the eXtended NF2 (XNF2) model. 2.5 Object Oriented and Object Relational Databases The weakness of the relational model to manage complex, highly interrelated information motivated the research for Object Databases (ODB) and Object Relational Databases (ORDB). Both models are also described in textbooks (for example Elmasri and Navathe, 2000 [5]). The portability and interoperability of ODBs is ensured by the Object Model suggested by the Object Database Management Group (ODMG). The ODMG Object Model provides also the definition for an Object Definition Language (ODL) and an Object Query Language (OQL). The ODL statements seem to (or are influenced by) the Java language statements used for class definitions, while the OQL statements seems to (or are influenced by) the SQL statements used for data retrieval. At mid of the 1990’s decade there were a notable number of ODB solutions, but today only half of those remain active. Our personal opinion is that the data management professionals do not like

Conceptual Universal Database Language: Moving Up the Database Design Levels

335

to bother themselves with strange programming constructions of classes, inheritance, etc, and consequently they do not decide easily to use an ODB. The other direction, the Object Relational Databases, aim to provide solutions for complex and highly interrelated information management, without imposing complicated programming constructions. For this reason, they provide Black Box Complex data types for various purposes (management of time series, geographic point manipulations, face recognition, content-based retrieval of digital audio, image watermarking, image search, full-text search), Opaque types for extending the repertoire of Black Box Complex data types and User Defined Complex data types. Black Box Complex data types are named as Data Blades in Informix Universal Server and are named as Cartridges in Oracle. The User Defined Complex data types have similar characteristics to the ODMG Objects. The composition of User Defined Complex data types is based on simpler structures (namely: the Collection types and the Row types). The SQL3 standard provides an extension to the previous SQL standards, for handling the most of the characteristics added with the ORDBs. 2.6 Approaches Based on Semantic Web Technology The purpose of Semantic Web is to share and reuse knowledge. The associated concepts, Web standards and query languages allow storing, querying and reasoning. Reasoning is a common-sense query processing and it requires a small number of deductions, in order to answer some query. To support the common-sense query processing, any Semantic Web system must be provided with a set of rules of deduction. However, Semantic Web is not designed with the intention of handling information efficiently and effectively, as Databases are. There exist approaches that combine the Semantic Web information technology and the Database information technology. One of the goals of these combinations is the interoperability of Semantic Web representations (Ontologies) and Databases [1]. Underneath of this interoperability is the translation of schemas from one information technology to the other [2]. The Relational data model and also the Functional data model have been used as the representatives of the Database technology, in these approaches of information technology combinations [1,13]. The Resource Description Framework (RDF) and its associated Query Language (SPARQL) are the most used representatives of the Semantic Web, in these approaches of information technology combinations. One of the purposes of this paper is to support a richer, than the Relational, data model, while the application developer, that uses this model, does not face complications in analysis, design and data manipulations. Our belief is that, the freedom of more complex data, that the Semantic Web information technology (through Ontologies) offers, operates against the efficiency, maintainability and simplicity of an application development oriented system.

3 The Missing Puzzle Item It is obvious from the plethora of models (presented in the Background Section) that there is a need for a DBMS able to provide more powerful data types, as those of the

336

N.N. Karanikolas and M.G. Vassilakopoulos

real world complex data. The first four discussed models (namely: Generic, EAV, FDB and NF2) provide composite data types using meta-models, on top of relational databases. The problem with these approaches is that the handling of the composite data requires very good acquaintance of the underlying meta-model structures and the internal organisation of both metadata and data. The user (programmer) should combine the business logic requirements with the retrieval of the metadata that explain the composition of the requested composite data types, the retrieval of the underlying simple data and the reverse composition of the requested composite data (three steps). In some case of the first four discussed models, an SQL query language extension is provided that uses Nested Select statements, whenever a restriction over a subfield should be applied. However this approach, except that it is not mature, complicates the deliverance (expression) of data maintenance statements for real applications. Thus, the ultimate requirement should be to provide a data manipulation language able to manipulate directly the composite data types (without bothering the programmer with the mentioned three steps), while permitting the direct expression of restrictions over subfields. By analogy to the ideas presented in the Object Relational model, the new data manipulation language should provide Black Box Complex data types and Opaque types for extending the repertoire of Black Box Complex data types. However the Object data types of the Object Relational model can be excluded from the new data manipulation language, since the provided composite data types cover satisfactorily the needs for complex data types.

4 The Solution – Conceptual Universal Database Language In our approach we have adopted the FDB [19,20] as the underlying model for implementing our goal for a data manipulation language able to manipulate directly composite data types. We preferred the FDB model, since it is more compact and well defined than the other models and also supports schema evolution. The logical schema of FDB (in its most refined version [10]) is based on the following tables: Name Languages Datatypes Messages Entities Tag_attributes Subfield_attributes Catalogue Tag_data Authority_links Subfield_data

Structure (language_id, lang_name) (datatype_id, datatype_name) (message_id, language, message) (frame_entity_id, title) (entity, tag, title, occurrence, repetition, authority, language, datatype, length) (entity, tag, subfield, title, occurrence, repetition, language, datatype, length) (entity, frame_object_number, frame_object_label, temp_stamp) (entity, frame_object, tag, repetition, chunk, tdata) from_entity, from_tag, from_subfield, to_entity, to_tag, to_subfield, relationship_type (entity, frame_object, tag, tag_repetition, subfield, subfld_repetition, chunk, sdata)

Notes: AM: Auxiliary Metadata, PM: Primary Metadata, D: Data, R: Relationships Primary keys are underlined

Use AM AM AM PM PM PM D D R D

Conceptual Universal Database Language: Moving Up the Database Design Levels

337

Let us explain this universal schema. Only three tables (sets in the FDB terminology) of the schema are used to host data, namely: Catalogue, Tag_data and Subfield_data. The rest sets host metadata information. Three of the metadata sets, namely the Entities, Tag_attributes and Subfield_attributes, are used to define every abstract entity and its constituents. One of the metadata sets, the Authority_links, is used to define data relationships. The rest three are auxiliary metadata sets. We must state that this universal schema is able to carry into effect 1:M and M:N relationships without the need for intermediate entities. 1:M relationships are carried into effect very easily with tags that accept repetitions. M:N relationships are also carried into effect with repetitions. Tags that accept repetitions represent the possibility of adding a list of values in the place of a single field. Also, a tag can entertain subfields. From the combination of the two above, the ability of placing an entire table in the place of a single field results. In addition to this, there is the ability that each of the cells that comprise the table can accept repetitions (list of values). Karanikolas et al. [9], introduced the syntax and semantics of the Conceptual Universal Database Language (CUDL). CUDL is a database language (both DDL and DML) that supports composite data types, i.e. attributes (tags in CUDL terminology) that combine more than one other sub-attributes (subfields in CUDL terminology). It also supports the existence of multiple values (repetitions) for both tags and subfields. The key difference of CUDL to other approaches is that not only tags, but also subfields and repetitions can be addressed in the CUDL statements. That means that the users can express retrieval restrictions over a specific subfield or can express the modification of a specific subfield in a specific repetition of some tag, etc. In [9], the authors focused mainly in presenting and analyzing the statement of value retrieval (in the schema and the data). More recently [11], they focused mainly in presenting and analysing the CUDL statement of value modifications in the schema (schema changes) and the data. They also have undertaken the need for relationship declarations [10]. This need becomes more significant for the FDB-CUDL model, because the relationships between entities, in most cases, are implemented without the introduction of new tables. Without having methods to declare relationships, the user would face a refuting stage, where the model is self-explained (the user can consult only tag_attributes and subfield_attributes and carry off the data model) but the data relationships are totally undocumented. To cope with this need, the FDB model introduced the Authority_links set. They also use the Authority_links set to declare authority controls and reduce variability of expressions used for the same instance of an identity. All of these (relationship declarations and authority control declarations) are provided through CUDL statements. In order to give an indicative example of the CUDL language, we suppose that some application undertakes the administration of the projects implemented by a company. In such an application there is an entity, named Projects, that contains all projects that the company services. Figure 1 provides two instances of the Projects entity. The following is a data retrieval statement, expressed in CUDL: # Find data when entity = ‘Projects’ and tag = ‘title’ restr data = ‘Hermes’ and subfield = ‘Action’ restr data = ‘program code’ and subfield = ‘Employee’

338

N.N. Karanikolas and M.G. Vassilakopoulos

With this CUDL statement we declare that the tag ‘title’ will be projected and concurrently will function as a restriction for the selection of instances, the subfield ‘Action’ will be projected and concurrently will function as a restriction for the selection of instances and finally the subfield ‘Employee’ will only be projected. Project_code Title Budget Actions

Project_code Title Budget Actions

Proj066 Hermes 455,000 Employee Yannis Vangelis Dimitris Panos

Action Software analysis Software requirements

Deadline 17/10/2007 22/01/2008

Program code

23/04/2008

Action grubbing

Deadline 20/3/2009

pruning

25/3/2009

watering

30/4/2009

Proj055 Athena 250,000 Employee Yannis Giorgos Maria Nikos

Fig. 1. Instances of the Projects entity

5 Moving Up the Database Design Levels As we have mentioned earlier, the ultimate goal of our previous research was to define a data manipulation language able to manipulate directly composite data types, while permitting the direct expression of restrictions over subfields. CUDL is the outcome of such an effort. In CUDL, the set of user defined (composite) data types depends on the user's inspiration and needs. The only restriction is that composite data types can be composed by only simple (predefined) data types (i.e. char, varchar, number, date, etc). There is no restriction for the capability of permitting repetitions (i.e. both tags and subfields can have repetitions). However, there is another interesting outgrowth of the introduction of CUDL. With CUDL, the application programmer / designer can model the structures of its application with composite data types, closer to the ER diagrams and sometimes without any decomposition of the ER entities into simpler ones. For example the ER diagram of Figure 2 is directly supported by the CUDL composite data types (see Figure 1). On the other hand the logical database level of any CUDL based application is the underlying FDB model. Thus, instead of transforming from ER to simple relational tables to provide a logical model for manipulation through SQL, we are able to transform from ER to CUDL Abstraction Level (CAL) entities (namely, CUDL data sets with composite data types). In other words, the classic database design triplet (ER, logical and physical design) is replaced by the quadruple: ER, CAL, logical and physical design, with a fixed logical design (the FDB model).

Conceptual Universal Database Language: Moving Up the Database Design Levels

339

Fig. 2. ER diagram for a company’s projects

CAL is the data model that the user conceives. It is a model that supports collections of uniform (congenerous) instances (frame objects in CAL terminology). The structure of each instance (of a concrete collection) is a set of attributes (tags in CAL terminology). In contrast to the relational model, attributes (tags) can have either simple (predefined) data types or can be composed by a set of sub-attributes (subfields in CAL terminology). CAL model also supports the existence of multiple values (repetitions) for both tags and subfields. Each CAL collection of uniform (congenerous) instances is called entity (also called "abstract entity" for reasons explained in [10]). The CUDL language is a database language for defining (DDL) and handling (DML) CAL entities. It is obvious, from the above example, that the transformation from ER entity types to CAL entities is a direct mapping. However, the transformation from ER relationship types to CAL structures is a more complicated process and we are going to consider it in the next section.. In the next section, we are going to examine a complete transformation example, i.e. the transformation of ER entity types, ER relationship types and ER attributes (of entity types and of relationship types) to CAL structures.

6 Transforming from ER to CAL Structures/Entities In order to make more evident the need for designing in upper levels we will provide an example of a really complicated ER (corresponding to a real situation) and its transformation to CAL entities (CUDL data sets with composite data types). We are going to present an Electronic Patient Record (EPR) that could be maintained in a real Hospital Information System (HIS). A brief description of our HIS follows: – One patient can have one or more incidents treated in the hospital. – During an incident the patient can be subject of a series of laboratorial examinations and also can be subject of a series of radiological examinations. – During an incident the patient can be subject of zero or more operations (surgeries). – There is a number of doctors (servant physicians) for each incident. – Moreover, there are a number of doctors (surgeons, anaesthesiologists, etc) that participate in each operation. – Each incident is characterized by a Social Security Institute (that undertakes the hospital fees – cover the expenses) and an ICD-9 diagnosis (final diagnosis of the incident).

N.N. Karanikolas and M.G. Vassilakopoulos

Fig. 3. The ER diagram of the HIS

340

Conceptual Universal Database Language: Moving Up the Database Design Levels

341

The ER diagram of Figure 3 is a more detailed conceptual depiction of the discussed HIS – EPR database. According to the ER diagram of Figure 3, there is a binary relationship between the Incident and the “Laboratorial Examination” entity types with cardinality ratio M:N. A similar binary relationship between the Incident and the “Radiological Examination” entity types, with cardinality ratio M:N, is also depicted. A third binary relationship with cardinality ratio M:N exists for the Incident and the Doctors entity types. This relationship is responsible for the servant physicians of the incident. The unique ternary relationship of the ER diagram is an identifying relationship of the weak entity type “Incident Operation”. There are two owner entity types that identify the weak entity type “Incident Operation”. These are the Incident and the Operation entity types. There is a binary relationship, with cardinality ratio M:N, between the weak entity type “Incident Operation” and the (strong) entity type Doctors. The later relationship expresses the set of doctors (surgeons, anaesthesiologists, etc) that participate in each “Incident Operation”. Next, we will provide the CAL entities that the ER diagram of Figure 3 is transformed. Some rules (general guidelines) for transforming from ER diagrams to CAL entities will be presented later. The CAL entities are (for composite and multivalued attributes, () and {} are used, respectively, Subsection 3.3.1 of [5]): Incident Incident_code, Date_started, Date_ended, Patient_code, SSI_code, {Lab_examinations (LE_code, LE_datetime, LE_result)}, {Rad_examinations (RE_code, RE_datetime, RE_FilePath)}, {Incident_operations (Op_code, IO_datetime_started, IO_datetime_ended, {IO_doctors})}, {Incident_doctors}, ICD9_code. Doctor Doctor_code, name, surname. Operation Op_code, name, cost. Rad_Examination RE_code, description, type, cost. Lab_Examination LE_code, name, normal_values, cost. Patient Patient_code, name, surname, father_name, tax_registration_number, date_of_birth. Social_Security_Institute SSI_code, name, immediate_insured_rate, intermediate_insured_rate. Diagnosis ICD9_code, ICD9_description. Since the most complicated CAL entity (structure) is the Incident we present an instance of Incident in the Figure 4.

342

N.N. Karanikolas and M.G. Vassilakopoulos

Incident_code

S001

Date_started

13/5/2007

Date_ended

20/5/2007

Patient_code

A001

SSI_code

T001

Incident_doctors

I001 I002 I079

ICD9_code Incident_ operations

Lab_ Examinations

Rad_ Examinations

574 Op_ code

IO_ datetime_started

IO_ datetime_ended

E002

14/5/2007 13:35

14/5/2007 15:05

E015

16/5/2007 12:00

16/5/2007 13:00

LE_ code UREA UREA UREA CREA CREA PROT PROT

LE_datetime

LE_result

15/5/2007 10:00 15/5/2007 14:30 16/5/2007 08:00 15/5/2007 10:00 16/5/2007 08:00 15/5/2007 10:00 16/5/2007 08:00

32,4 mg/dl 32,5 mg/dl 31,6 mg/dl 1,17 mg/dl 1,08 mg/dl 7,19 g/dl 6,95 g/dl

IO_ doctors I001 I005 I100 I065 I012 I100 I032

RE_code

RE_datetime

RE_FilePath

U/S Kidney

16/5/2007 12:00

\\FS1\RIS\ Uaz34.tif

Fig. 4. An Incident frame object

The rules for transforming from (relationships types of) ER diagrams to CAL entities (at least for the cases that appear mostly in real applications, like those needed for transforming from the ER of Figure 3 to the CAL entity Incident) are the following: Rule1. Binary relationships with cardinality ratio M:N can be hosted in one of the two related entities as a tag (field) with repetitions. In case that the relationship has no attributes, a simple (not composite) tag with repetitions is capable to store the primary key values of the related (hosted) entity. Whenever the relationship has attributes, a composite tag (composed of the primary key values of the hosted entity and the relationship’s attributes) with repetitions should be used. Obviously, the key of the hosting participant is not needed. This rule applies for the relationship between the Incident and the “Laboratorial Examination” entity types and for the relationship between the Incident and the Doctor entity types. Since we have selected to host the

Conceptual Universal Database Language: Moving Up the Database Design Levels

343

relationships in the Incident entity type, the former relation is implemented with the following composite tag with repetitions: {Lab_examinations (LE_code, LE_datetime, LE_result)}. For the same reason, the later relation is implemented with the following simple tag with repetitions: {Incident_doctors}. Rule 2. Binary identifying relationships can be transformed to a composite tag with repetitions, hosted in the owner entity type. In this case, the composite tag should be composed by the partial key (of the weak entity) and the rest attributes of the relationship. Obviously, the key of the hosting owner entity is not needed. There is no application of this rule in the studied ER diagram. Rule 3. The previous rule can be extended for ternary identifying relationships. Ternary identifying relationships can be transformed to a composite tag with repetitions, hosted in one of the (two) owner entity types. In this case, the composite tag should be composed by the partial key (of the weak entity type), the key of the hosted owner entity type and the rest attributes of the relationship. Obviously, the key of the hosting owner entity type is not needed. This rule applies for the weak entity type “Incident Operation”. Since we have selected to host the relationships in the Incident owner entity type, the relation is implemented with the Incident_operations composite tag with repetitions: {Incident_operations (Op_code, IO_datetime_started, IO_datetime_ended, {IO_doctors})} The partial key of the weak entity type participates in the Incident_operations as the IO_datetime_started subfield. The key of the owner entity type Operation participates in the Incident_operations as the Op_code subfield. The subfield IO_datetime_ended is an attribute of the relationship. The role of the IO_doctors subfield will be explained with the next rule. Rule 4. Binary relationships with cardinality ratio M:N, relating a weak entity type with some other (strong) entity type, without having relationship attributes, can be transformed to an extra subfield with repetitions of the composite tag implementing the weak entity. This rule applies for the relationship between the weak entity type “Incident Operations” and the entity type Doctor. This rule explains the last constituent of the Incident_operations tag, presented above. (The domain of {IO_doctors} is the power set of the Doctor_code’s domain.) Rule 5. Binary relationships with cardinality ratio 1:N can be hosted in the N-side entity type, as a tag (field) without repetitions. In case that the relationship has no attributes, a simple (not composite) tag is enough for storing the primary key values of the oppositeside entity type. Otherwise, whenever the relationship has attributes, a composite tag should be used. The primary key of the opposite-side entity type and the relationship’s attributes composes this composite tag. Obviously, the key of the hosting (N-side) participant is not needed. This rule applies for the relationships of the Incident with the entity types Patient, Social_Security_Institute and Diagnosis, respectively. The tags

344

N.N. Karanikolas and M.G. Vassilakopoulos

Patient_code, SSI_code and ICD9_code of the CAL entity Incident implement these relationships. Actually, the rules presented above are also responsible for updating the foreignkey constraints. These constraints were discussed in a previous work [10] and are not analyzed further here.

7 Conclusions So far, the design of an application having a relational data repository mainly required the decomposition of the real world structures into very simple attributes, the composition of a logical schema with naive relational structures and the formation of query and manipulation (SQL) statements based on the logical schema. The contribution of this paper consists in – arguing about the need for a database query and manipulation language, like CUDL, able to handle directly composite entities of more abstract data models that offer composite / complex data types expressing real world entities (without transforming them to a relational logical schema), – arguing that the types used in CUDL allow a database designer / developer to express the structures of his application with types that are very close to the ER entity types, while ER entity types can be directly transformed to CAL entity types, and – providing a set of rules for transforming (the relationship types of) an ER diagram to CAL, so that the CUDL can be utilized to manipulate the resulting high level entities. Thus, the designers and developers can be concentrated with the business logic of their applications, instead of wasting time for the expression of statements that manipulate naive database structures. Possible future research directions include – completing the analysis and design process that consists in the quadruple “ER, CAL, FDB, physical level” by providing a physical model and a transformation of the FDB to this model, – implementing a database machine storing data based on this physical model and being able to process CUDL queries, – designing algorithms for optimized processing of CUDL queries, – evaluating this machine in terms of performance and suitability for development of complex applications, and – examining further the relation of the Web Semantic technology and our techniques. Part of the literature review has been presented, by the first Author, in a previous conference.

References 1. Atzeni, P., Del Nostro, P., Paolozzi, S.: Ontologies And Databases: Going Back And Forth. In: 4th International VLDB Workshop on Ontology-based Techniques for DataBases in Information Systems and Knowledge Systems (ODBIS 2008), Auckland, New Zealand (2008)

Conceptual Universal Database Language: Moving Up the Database Design Levels

345

2. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P.A., Gianforme, G.: Model-independent schema translation. The VLDB Journal 17(6), 1347–1370 (2008) 3. Anhøj, J.: Generic Design of Web-Based Clinical Databases. Journal Medical Internet Research 4 (2003), http://www.jmir.org/2003/4/e27/ 4. Cai, J., Johnson, S., Hripcsak, G.: Generic Data Modeling for Home Telemonitoring of Chronically Ill Patients. In: American Medical Informatics Association - Annual Symposium 2000 (AMIA 2000), Los Angeles, CA, pp. 116–120 (2000) 5. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 3rd edn. Addison Wesley Publishing Company, Reading (2000) 6. Fischer, P.C., Van Gucht, D.: Determining when a Structure is a Nested Relation. In: 11th Int. Conf. on Very Large DataBases (VLDB 1985), pp. 171–180. Morgan Kaufmann, Stockholm (1985) 7. Johnson, S.B.: Generic data modeling for clinical repositories. Journal of American Medical Informatics Association 3(5), 328–339 (1996) 8. Johnson, S.B., Chatziantoniou, D.: Extended SQL for manipulating clinical warehouse data. In: American Medical Informatics Association Symposium (AMIA 1999), pp. 819– 823. AMIA (1999) 9. Karanikolas, N.N., Nitsiou, M., Yannakoudakis, E.J., Skourlas, C.: CUDL language semantics, liven up the FDB data model. In: 11th East-European Conf. on Advances in Databases and Information Systems (ADBIS 2007), local proceedings, pp. 1–16. Technical Univ. of Varna, Varna (2007) 10. Karanikolas, N.N., Nitsiou, M., Yannakoudakis, E.J.: CUDL Language Semantics: Authority Links. In: 12th East-European Conf. on Advances in Databases and Information Systems (ADBIS 2008), pp. 123–139. Tampere Univ. of Technology, Pori (2008) 11. Karanikolas, N.N., Nitsiou, M., Yannakoudakis, E.J., Skourlas, C.: CUDL Language Semantics: Updating Data. Journal of Systems and Software 82(6), 947–962 (2009) 12. van Keulen, M., Vonk, J., de Vries, A.P., Flokstra, J., Blok, H.E.: Moa: extensibility and efficiency in querying nested data. Technical Report TR-CTIT-02-19. Centre for Telematics and Information Technology, Univ. of Twente, The Netherlands (2002) 13. Martins, J., Nunes, R., Karjalainen, M., Kemp, G.J.L.: A Functional Data Model Approach to Querying RDF/RDFS Data. In: Gray, A., Jeffery, K., Shao, J. (eds.) BNCOD 2008. LNCS, vol. 5071, pp. 153–164. Springer, Heidelberg (2008) 14. Nadkarni, P.M.: Clinical Patient Record Systems Architecture: An Overview. Journal of Postgraduate Medicine 46(3), 199–204 (2000) 15. Nadkarni, P.: An introduction to entity-attribute-value design for generic clinical study data management systems. Presentation in: National GCRC Meeting, Baltimore, MD (2002) 16. Pavković, N., Štorga, M., Pavlić, D.: Two Examples of Database Structures in Management of Engineering Data. In: 12th Int. Conf. on Design Tools and Methods in Industrial Engineering, pp. 89–90. ADM-Associazione Nazionale Disegno di Macchine, Bologna (2001) 17. Schek, H.J., Pistor, P.: Data Structures for an Integrated Data Base Management and Information Retrieval System. In: 8th Int. Conf. on Very Large DataBases (VLDB 1982), pp. 197–207. Morgan Kaufmann, Mexico City (1982) 18. Worboys, M.F., Hearnshaw, H.M., Maguire, D.J.: Object-Oriented Data Modelling for Spatial Databases. Int. Journal of Geographical Information Systems 4(4), 369–383 (1990)

346

N.N. Karanikolas and M.G. Vassilakopoulos

19. Yannakoudakis, E.J., Tsionos, C.X., Kapetis, C.A.: A new framework for dynamically evolving database environments. Journal of Documentation 55(2), 144–158 (1999) 20. Yannakoudakis, E.J., Diamantis, I.K.: Further improvements of the Framework for Dynamic Evolving of Database environments. In: 5th Hellenic – European Conf. on Computer Mathematics and its Applications (HERCMA 2001), Athens, Greece (2001) 21. Yannakoudakis, E.J., Nitsiou, M., Skourlas, C., Karanikolas, N.N.: Tarski algebraic operations on the frame database model (FDB). In: 11th Panhellenic Conf. in Informatics (PCI 2007), pp. 207–216. New Technologies Publications, Patras (2007)

Temporal Data Classification Using Linear Classifiers Peter Revesz and Thomas Triplet University of Nebraska - Lincoln, Lincoln NE 68588, USA [email protected], [email protected]

Abstract. Data classiﬁcation is usually based on measurements recorded at the same time. This paper considers temporal data classiﬁcation where the input is a temporal database that describes measurements over a period of time in history while the predicted class is expected to occur in the future. We describe a new temporal classiﬁcation method that improves the accuracy of standard classiﬁcation methods. The beneﬁts of the method are tested on weather forecasting using the meteorological database from the Texas Commission on Environmental Quality.

1

Introduction

Data classiﬁers, such as support vector machines or SVMs [24], decision trees [15], or other machine learning algorithms, are widely used. However, they are used to classify data that occur in the same time period. For example, a set of cars can be classiﬁed according to their fuel eﬃciency. That is acceptable because the fuel eﬃciency of cars is not expected to change much over time. Similarly, we can classify a set of people according to their current heart condition. However, people’s heart condition can change over time. Therefore, it would be more interesting to classify people using the current information according to whether they are likely to develop serious heart condition some time in the future. Cconsider a patient who transfers from one doctor to another. The new doctor may give the patient a set of tests and use the new results to predict the patient’s prospects. The question arises whether this prediction could be enhanced if the new doctor would get the older test results of the patient. Intuitively, there are cases where the old test results could be useful for the doctor. For example, the blood pressure of a patient may be 130/80, which may be considered within normal. However, if it was 120/80 last year and 110/80 the year before, then the doctor may be still concerned about the steady rise of the patient’s blood pressure. On the other hand, if the patient’s blood pressure in the past was always around 130/80, then the doctor may be more conﬁdent of predicting the patient to be in good health. Therefore, the history of the patient is important in distinguishing between these two cases. Nevertheless, the temporal history of data is usually overlooked in the machine learning area. There are only a few previous works that combine some kind of spatio-temporal data and classiﬁcation algorithms. Qin and Obradovic [14] are J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 347–361, 2009. c Springer-Verlag Berlin Heidelberg 2009

348

P. Revesz and T. Triplet

Fig. 1. Comparison of the standard and the temporal classiﬁcation methods

interested in incrementally maintaining an SVM classiﬁer when new data is added to a database. Therefore, [14] is not useful to predict the future health of a patient or other classes that one may want to predict for the future. Tseng and Lee [22] classify temporal data using probabilistic induction. Our earlier work [19] considered data integration and reclassiﬁcation by classiﬁers when all the data was measured at the same time. In this paper, we propose a new temporal classiﬁcation method that instead of probabilistic induction [22] extends existing linear classiﬁers to deal with temporal data. Figure 1 compares the standard classiﬁers and the new temporal classiﬁer method. The standard classiﬁers take as input the current (at time t) values of the features in the feature space and the class label some n time units ahead (at time t + n). The temporal classiﬁers take as input in addition to the current features and the class, the history, that is, the old values of the features up to some i time units back in time (that is, from time t − i to t − 1). Weather forecasting is a challenging task. It is also natural to study because the major interest is in the prediction of the weather ahead of time instead of describing the current conditions. We tested our temporal classiﬁer on a meteorological database of the Texas Comission on Environmental Quality. At a ﬁrst glance it would seem useless to look at the weather history back more than a couple of days. Surprisingly, we discovered that the history does matter more than expected and the classiﬁcation can be improved if one looks back 15 days back in time.

Temporal Data Classiﬁcation Using Linear Classiﬁers

349

We were also surprised that the history of some features were considerably more useful than the history of the others. Moreover, the features that are the most important when looking at only time t are not the same as the features that are important when one looks at the weather history. That happens because the diﬀerent the features have diﬀerent permanency. For example, wind direction may change greatly form one hour to another. On the other hand, ozone levels are fairly constant. The rest of the paper is organized as follows. Section 2 presents a review of classiﬁers and constraint databases. Section 3 describes our database representation and querying of linear classiﬁers. These representations are used in our implementations. Section 4 presents the new temporal classiﬁcation method and a corresponding data mapping. Section 5 describes computer experiments and discusses the results. Finally, Section 6 gives some concluding remarks and open problems.

2

Review of Classifiers and Constraint Databases

In many problems, we need to classify items, that is, we need to predict some characteristic of an item based on several parameters of the item. Each parameter is represented by a variable which can take a numerical value. Each variable is called a feature and the set of variables is called a feature space. The number of features is the dimension of the feature space. The actual characteristic of the item we want to predict is called the label or class of the item. To make the predictions, we use classifiers. Each classiﬁer maps a feature space X to a set of labels Y . The classiﬁers are found by various methods using a set of training examples, which are items where both the set of features and the set of labels are known. A linear classifier maps a feature space X to a set of → labels Y by a linear function. In general, a linear classiﬁer f (− x ) can be expressed as follows: → → → f (− x ) = − w ·− x+b = wi xi + b (1) i

where wi ∈ R are the weights of the classiﬁers and b ∈ R is a constant. The value → → of f (− x ) for any item − x directly determines the predicted label, usually by a → simple rule. For example, in binary classiﬁcations if f (− x ) ≥ 0, then the label is +1 else the label is −1 . Example 1. Suppose that a disease is conditioned by two antibodies A and B. The feature space is X = {Antibody A, Antibody B} and the set of labels is Y = {Disease, N o Disease}, where Disease corresponds to +1 and No Disease corresponds to −1. Then, a linear classiﬁer is: f ({Antibody A, Antibody B}) = w1 Antibody A + w2 Antibody B + b where w1 , w2 ∈ R are constant weights and b ∈ R is a constant. We can use the value of f ({Antibody A, Antibody B}) as follows:

350

P. Revesz and T. Triplet

Fig. 2. A set of training examples with labels +1 (♦) and −1 (•). This set is linearly separable because a linear decision function in the form of a hyperplane can be found that classiﬁes all examples without error. Two possible hyperplanes that both classify the training set without error are shown (solid and dashed lines). The solid line is expected to be a better classiﬁer than the dashed line because it has a wider margin, which is the distance between the closest points and the hyperplane.

– If f ({Antibody A, Antibody B}) ≥ 0 then the patient has Disease. – If f ({Antibody A, Antibody B}) < 0 then the patient has No Disease. 2.1

Support Vector Machines

Suppose that numerical values can be assigned to each of the n features in the → feature space. Let − xi ∈ Rn with i ∈ [1..l] be a set of l training examples. Each → − training example xi can be represented as a point in the n-dimensional feature space. Support Vector Machines (SVMs) [24] are increasingly popular classiﬁcation tools. SVMs classify the items by constructing a hyperplane of dimension n − 1 that will split all items into two sets of classes +1 and −1. As shown in Figure 2, several separating hyperplanes may be suitable to split correctly a set of training examples. In this case, an SVM will construct the maximum-margin hyperplane, that is, the hyperplane which maximizes the distance to the closest training examples. 2.2

ID3 Decision Trees

Decision trees were frequently used in the nineties by artiﬁcial intelligence experts because they can be easily implemented and they provide an explanation of the result. A decision tree is a tree with the following properties: – Each internal node tests an attribute. – Each branch corresponds to the value of the attribute. – Each leaf assigns a classiﬁcation.

Temporal Data Classiﬁcation Using Linear Classiﬁers

351

ID3 [15] is a greedy algorithm that builds decision trees. The ID3 decision tree and SVMs are both linear classiﬁers because their eﬀects can be represented mathematically in the form of Equation (1). 2.3

Constraint Databases

Constraint databases [13,17] form an extension of relational databases [7] where the database can contain variables that are usually constrained by linear or polynomial equations. Example 2. Figure 3 shows a moving square, which at time t = 0 starts at the ﬁrst square of the ﬁrst quadrant of the plane and moves to the northeast with a speed of one unit per second to the north and one unit per second to the east. Moving Square XYT x y t x ≥ t, x ≤ t + 1, y ≥ t, y ≤ t + 1, t ≥ 0 When t = 0, then the constraints are x ≥ 0, x ≤ 1, y ≥ 0, y ≤ 1, which is the unit square in the ﬁrst quadrant. We can calculate similarly the position of the square at any time t > 0 seconds. For example, when t = 5 seconds, then the constraints become x ≥ 5, x ≤ 6, y ≥ 5, y ≤ 6, which is another square with lower left corner (5, 5) and upper right corner (6, 6). Constraint databases can be queried by both Datalog and SQL queries [1,16,23]. Constraint database systems include CCUBE [4], DEDALE [9], IRIS [3], and MLPQ [18].

Fig. 3. A moving square

352

P. Revesz and T. Triplet

Constraint databases, which were initiated by Kanellakis et al. [12], have many applications ranging from spatial databases [21,6] through moving objects [10,2] to epidemiology [20]. However, only Geist [8] and Johnson et al. [11] applied them to classsiﬁcation problems. In particular, both Geist [8] and Johnson et al. [11] discussed the representation of decision trees by constraint databases.

3

Representation and Querying of Linear Classifiers

This section describes the representation of linear classiﬁers in constraint databases [13,17], which were reviewed in Section 2.3. In each case, the constraint database representation can be queried using any linear constraint database system. We also describe a few typical queries that are useful for classifying new data. 3.1

Representation and Querying of SVMs

The Texas Commission on Environmental Quality (TCEQ) database (see Section 5.1 for details) contains weather data for over 7 years. For simplicity, consider the following smaller version with only six consecutive days, where for each day D, the features are: Precipitation P, Solar Radiation R, and Wind Speed (north-south component) W, and the label is Temperature T, which is ”High” or ”Low.” Texas Weather D P R W T 1 1.73 2.47 -1.3 Low 2 0.95 3.13 9.32 High 3 3.57 3.56 4.29 Low 4 0.24 1.84 1.51 Low 5 0.0 1.19 3.77 High 6 0.31 4.72 -0.06 High To classify the above data, we can use a SVM linear classiﬁer. First, we need to assign a numerical value to symbolic features because SVMs are unable to handle non-numerical values. For instance, we assign the value t = −1 whenever t = low and t = +1 whenever t = high . Then, we use the svmlib[5] library to build a linear classiﬁcation using a SVM. That would result in a linear classiﬁer, which can be represented by the following linear constraint relation: Texas SVM PRWT p r w t −0.442838p + 0.476746r + 2.608779w − 0.355809 = t Given the T exas W eather(d, p, r, w) and the T exas SV M (p, r, w, t) relations, the following Datalog query ﬁnds for each day the distance t to the hyperplane separating the two temperature classes.

Temporal Data Classiﬁcation Using Linear Classiﬁers

353

Temp_SVM(d, t) :- Texas_Weather(d, p, r, w), Texas_SVM(p, r, w, t).

Finally, we can use the SV M relation to do the predictions, based on whether we are above or below the hyperplane. Predict(d, y) :- Temp_SVM(d, t), ’high’ = y, t >= 0. Predict(d, y) :- Temp_SVM(d, t), ’low’ = y, t < 0. Instead of the above Datalog queries, one can use the logically equivalent SQL query: CREATE VIEW Predict AS SELECT D.d, "High" FROM Texas_Weather as D, WHERE D.p = T.p AND D.r UNION SELECT D.d, "Low" FROM Texas_Weather as D, WHERE D.p = T.p AND D.r 3.2

Texas_SVM as T = T.r AND D.w = T.w

AND

T.t >= 0

Texas_SVM as T = T.r AND D.w = T.w

AND

T.t < 0

Representation and Querying of ID3 Decision Trees

Figure 4 shows the ID3 decision tree for the Texas Weather Data in Section 3.1. Note that in this ID3 decision tree only the Precipitation feature is used. That is because the value of Precipitation is enough to classify the data for each day in the small database. For a larger database some precipitation values are repeated and other features need to be looked at to make a classiﬁcation. A straightforward translation from the ID3 decision tree in Figure 4 to a linear constraint database yields the following. Texas ID3 PRWT p r w t p r w t p r w t p r w t p r w t p r w t

p = 1.73, t = Low p = 0.95, t = High p = 3.57, t = Low p = 0.24, t = High p = 0.0, t = Low p = 0.31, t = High

Fig. 4. Decision Tree for the prediction of the temperature using the weather dataset

354

P. Revesz and T. Triplet

Given the T exas W eather(d, p, r, w) and the T exas ID3(p, r, w, t) relations, the following Datalog query can be used to predict the temperature for each day: Predict(d, t) :- Texas_Weather(d, p, r, w), Texas_ID3(p, r, w, t). Instead of Datalog queries, one can use the logically equivalent SQL query: CREATE VIEW Predict AS SELECT D.d, T.t FROM Texas_Weather as D, Texas_ID3 as T WHERE D.p = T.p AND D.r = T.r AND D.w = T.w 3.3

Representation and Querying of ID3-Interval Decision Trees

A straightforward translation from the original decision tree to a linear constraint database does not yield a good result for problems where the attributes can have real number values instead of only discrete values. Real number values are often used when we measure some attribute like the wind speed in miles-per-hour or the temperature in degrees Celsius. Hence we improve the naive translation by introducing comparison constraints >, <, ≥, ≤ to allow continuous values for some attributes. That is, we translate each node of the decision tree by analyzing all of its children. First, the children of each node are sorted based on the possible values of the attribute. Then, we deﬁne an interval around each discrete value based on the values of the

Fig. 5. Decision Tree for the prediction of the temperature using the weather dataset

Temporal Data Classiﬁcation Using Linear Classiﬁers

355

previous and the following children. The lower bound of the interval is deﬁned as the median value between the value of the current child and the value of the previous child. Similarly, the upper bound of the interval is deﬁned as the median value of the current and the following children. For instance, assume we have the values {10, 14, 20} for an attribute for the children. This will lead to the intervals {(−∞, 12], (12, 17], (17, +∞)}. Figure 5, which shows a modiﬁed decision tree, based on the above heuristic. Translating that modiﬁed decision tree yields the following constraint relation: Texas ID3-Interval PRWT p r w t p r w t p r w t p r w t p r w t p r w t

r r r r r r

< 2, w < 2.64, t = Low < 2, w ≥ 2.64, t = High ≥ 2, r < 4.3, p < 2.51, w < 8.63, t = Low ≥ 2, r < 4.3, p < 2.51, w ≥ 8.63, t = High ≥ 2, r < 4.3, p ≥ 2.51, t = Low ≥ 4.3, t = High

The querying of ID3-Interval decision tree representations can be done like the querying of ID3 decision tree representations after replacing T exas ID3 with T exas ID3 − Interval.

4

A Temporal Classification Method

The T exas W eather database in Section 3.1 is an atypical data for linear classiﬁers because it involves a temporal dimension. Although one may consider each day as an independant instance and simply ignore the temporal dimension, as we did earlier, it probably would not be the best solution. Instead, we propose below a temporal classiﬁcation method for dealing with temporal data. The temporal classiﬁcation method is based on an alternative representation of the database. As an example, the T exas W eather(d, p, r, w, t) relation can be rewritten into the temporal relation T exas W eather History(d, pd−2 , rd−2 , wd−2 , pd−1 , rd−1 , wd−1 , pd , rd , wd , t) where for any feature f ∈ {p, r, w} the fi indicates the day i when the measurements are taken. Note that even though we did not use in T exas W eather any subscript, the implicit subscript for the features was always d. Now the subscripts go back in time, in this particular representation two days back to d − 1 and d − 2. The T exas W eather History relation is the following. Texas Weather History D 3 4 5 6

Pd−2 1.73 0.95 3.57 0.24

Rd−2 2.47 3.13 3.56 1.84

Wd−2 -1.3 9.32 4.29 1.51

Pd−1 0.95 3.57 0.24 0.0

Rd−1 3.13 3.56 1.84 1.19

Wd−1 9.32 4.29 1.51 3.77

Pd Rd Wd T 3.57 3.56 4.29 Low 0.24 1.84 1.51 Low 0.0 1.19 3.77 High 0.31 4.72 -0.06 High

356

P. Revesz and T. Triplet

The T exas W eather History relation uses the same set of feature measures as the T exas W eather relation because the data in the Pd−2 , Rd−2 , Wd−2 and the Pd−1 , Rd−1 , Wd−1 columns are just shifted values of the Pd , Rd , Wd columns. However, when the T exas W eather History relation is used instead of the T exas W eather relation to generate one of the linear classiﬁers, then represented and queried as in Section 3, then there is a potential for improvement because each training data includes a more complete set of features. For example, if today’s precipitation is a relevant feature in predicting the temperature a week ahead, then it is likely that yesterday’s and the day before yesterday’s precipitations are also relevant features in predicting the temperature a week ahead. That seems to be the case because the precipitation from any particular day tends to stay in the ground and aﬀect the temperature for many more days. Moreover, since the average precipitation of three consequtive days varies less than the precipitation on a single day, the former may be more reliable than the latter for the prediction of the temperature a week ahead. These intuitions lead us to believe that the alternative representation is advantageous for classifying temporal data. Although this seems a simple idea, it was not tried yet for decision trees or SVMs. In general, the alternative representation allows one to go back i number of days and look ahead n days, as outlined in Figure 1. The original representation is a representation that looks back 0 days and looks ahead the same number n of days. Therefore, the transformation from a basic to an alternative representation, which we denote by =⇒, can be described as: T exas W eather0,n =⇒ T exas W eather History i,n where for any relation the ﬁrst superscript is the days of historical data and the second superscript is the days predicted in the future.

5 5.1

Experimental Results and Discussion Experiments with TCEQ Data

We experimentally compared the regular classiﬁcation and the temporal classiﬁcation methods. In some experiments both the regular and the temporal classiﬁcation methods used SVMs and in some other experiments both methods used decision trees. In particular, we used the SVM implementation from the SVMLib [5] library and our implementation of the ID3-Interval algorithm described in Section 3.2. The experiments used the Texas Commission on Environmental Quality (TCEQ) database (available from http://archive.ics.uci.edu/ml), which recorded meteorological data between 1998 and 2004. From the TCEQ database, we used only the data for Houston, Texas and the following fourty features and the class to predict.

Temporal Data Classiﬁcation Using Linear Classiﬁers 1-24. 25. 26. 27. 28-30. 31-33. 34-36. 37-39. 40. 41.

357

sr: hourly solar radiation measurements asr: average solar radiation ozone: ozone pollution (0 = no, 1 = yes) tb: base temperature where net ozone production begins dew: dew point (at 850, 700 and 500 hPa) ht: geopotential height (at 850, 700 and 500 hPa) wind-SN: south-north wind speed component (at 850, 700 and 500 hPa) wind-EW: east-west wind speed component (at 850, 700 and 500 hPa) precp: precipitation T: temperature class to predict

For sr, dew, ht, wind-SN, wind-EW we use a subscript to indicate the hour or the hPa level. We also use the following procedure to predict the temperature T , where n is a training set size control parameter: 1. 2. 3. 4. 5.

Normalize the dataset. Randomly select 60 records from the dataset as a testing set. Randomly select n percent of the remaining records as a training set. Build a SVM, ID3, or ID3-Interval classiﬁcation using the training data. Test the accuracy of the classiﬁcation on the testing set.

In step (1), the data was normalized by making for each feature the lowest value to be −1 and the highest value to be +1 and proportionally mapped into the interval [−1, +1] all the other values. This normalization was a precaution against any bias by the classiﬁcations. The normalization also allowed a clearer comparison of the SVM weights of the features. For testing the regular classiﬁers, we used the above procedure with TCEQ0,2 , which we obtained from the original TCEQ0,0 database by shifting backwards by two days the T column values. For testing the temporal classiﬁers, we made the transformation TCEQ0,2 =⇒ TCEQ15,2 as described in Section 4. Figure 6 reports the average results of repeating the above procedure twelve times for n equal to 5, 15, 25, . . . , 95 using the original ID3 algorithm. Similarly, Figure 7 reports the average results using SVMs. The experiments show that adding the historical data signiﬁcantly improves the temperature predictions using both the ID3 and the SVM algorithms. Moreover, the SVM algorithm performed better than the original ID3 algorithm, although the ID3-Interval algorithm (not shown) gave some improvements. 5.2

Experiments with Reduced TCEQ Data

Databases with a large number of features often include many noisy variables that do not contribute to the classiﬁcation. The TCEQ database also appears to include many noisy variables because the SVM placed small weights on them. Since we normalized the data, the relative magnitudes of the SVM weights correspond to the relative importance of the features. In particular, the following numerical features had the highest weights:

358

P. Revesz and T. Triplet

Fig. 6. Comparison of regular and temporal classiﬁcation using 40 features and ID3

Fig. 7. Comparison of regular and temporal classiﬁcation using 40 features and SVMs

Temporal Data Classiﬁcation Using Linear Classiﬁers

359

Fig. 8. Comparison of regular and temporal classiﬁcation using 3 features and ID3

Fig. 9. Comparison of regular and temporal classiﬁcation using 3 features and SVMs

360

P. Revesz and T. Triplet

25. asr: average solar radiation 35. wind-SN700 : south-north wind speed component at 700 hPa 40. precp: precipitation How accurate classiﬁcation can be obtained using only these three selected features? These features have some interesting characteristics that make them better than other features. For example, wind-SN700 , the south-north wind speed component, is intuitively more important than wind-EW700 , the east-west wind speed component, in determining the temperature in Houston, Texas. In addition, the precipitation can stay in the ground for some time and aﬀect the temperature a longer period than most of the other features. Hence our hypothesis was that these three features can already give an accurate classiﬁcation. To test this hypothesis, we conducted another set of experiments by applying the experimental procedure described in Section 5.1 to the reduced three-feature TCEQ database. The results of these experiments are shown in Figures 8 and 9. The accuracies of the classiﬁers based on only three features were surprisingly similar to the accuracies of the classiﬁers based on all fourty features. In this experiment the temporal classiﬁcation was again more accurate than the traditional classiﬁcation.

6

Conclusions

There are some other remaining questions. For example, would non-linear temporal classiﬁers also be better than regular non-linear classiﬁers? In the future, we plan to experiment with other data sets and use non-linear classiﬁers in addition to SVMs and decision trees. Acknowledgement. The ﬁrst author was supported in part by a J. William Fulbright senior U.S. scholarship. The second author was supported in part by a Milton E. Mohr fellowship and Concordia University.

References 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995) 2. Anderson, S., Revesz, P.: Eﬃcient maxcount and threshold operators of moving objects. Geoinformatica 13 (2009) 3. Bishop, B., Fischer, F., Keller, U., Steinmetz, N., Fuchs, C., Pressnig, M.: Integrated Rule Inference System (2008), www.iris-reasoner.org 4. Brodsky, A., Segal, V., Chen, J., Exarkhopoulo, P.: The CCUBE constraint objectoriented database system. Constraints 2(3-4), 245–277 (1997) 5. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2001), www.csie.ntu.edu.tw/~ cjlin/libsvm 6. Chomicki, J., Haesevoets, S., Kuijpers, B., Revesz, P.: Classes of spatiotemporal objects and their closure properties. Annals of Mathematics and Artiﬁcial Intelligence 39(4), 431–461 (2003)

Temporal Data Classiﬁcation Using Linear Classiﬁers

361

7. Codd, E.F.: A relational model for large shared data banks. Communications of the ACM 13(6), 377–387 (1970) 8. Geist, I.: A framework for data mining and KDD. In: Proc. ACM Symposium on Applied Computing, pp. 508–513. ACM Press, New York (2002) 9. Grumbach, S., Rigaux, P., Segouﬁn, L.: The DEDALE system for complex spatial queries. In: Proc. ACM SIGMOD International Conference on Management of Data, pp. 213–224 (1998) 10. G¨ uting, R., Schneider, M.: Moving Objects Databases. Morgan Kaufmann, San Francisco (2005) 11. Johnson, T., Lakshmanan, L.V., Ng, R.T.: The 3W model and algebra for uniﬁed data mining, pp. 21–32 (2000) 12. Kanellakis, P.C., Kuper, G.M., Revesz, P.: Constraint query languages. Journal of Computer and System Sciences 51(1), 26–52 (1995) 13. Kuper, G.M., Libkin, L., Paredaens, J. (eds.): Constraint Databases. Springer, Heidelberg (2000) 14. Qin, Y., Obradovic, Z.: Eﬃcient learning from massive spatial-temporal data through selective support vector propagation. In: 17th European Conference on Artiﬁcial Intelligence, pp. 526–530 (2006) 15. Quinlan, J.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986) 16. Ramakrishnan, R.: Database Management Systems. McGraw-Hill, New York (1998) 17. Revesz, P.: Introduction to Constraint Databases. Springer, Heidelberg (2002) 18. Revesz, P., Chen, R., Kanjamala, P., Li, Y., Liu, Y., Wang, Y.: The MLPQ/GIS constraint database system. In: Proc. ACM SIGMOD International Conference on Management of Data (2000) 19. Revesz, P., Triplet, T.: Reclassiﬁcation of linearly classiﬁed data using constraint databases. In: 12th East European Conference on Advances of Databases and Information Systems, pp. 231–245 (2008) 20. Revesz, P., Wu, S.: Spatiotemporal reasoning about epidemiological data. Artiﬁcial Intelligence in Medicine 38(2), 157–170 (2006) 21. Rigaux, P., Scholl, M., Voisard, A.: Introduction to Spatial Databases: Applications to GIS. Morgan Kaufmann, San Francisco (2000) 22. Tseng, V.S., Lee, C.-H.: Eﬀective temporal data classiﬁcation by integrating sequential pattern mining and probabilistic induction. Expert Systems with Applications 36(5), 9524–9532 (2009) 23. Ullman, J.D.: Principles of Database and Knowledge-Base Systems. Computer Science Press (1989) 24. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

SPAX – PAX with Super-Pages Daniel B¨oßwetter Freie Universit¨ at Berlin Institute of Computer Science Database and Information Systems Group Takustraße 9, 14195 Berlin, Germany [email protected] http://www.inf.fu-berlin.de/en/groups/ag-db/index.html

Abstract. Much has been written about the pros and cons of columnorientation as a means to speed up read-mostly analytic workloads in relational databases. In this paper we try to dissect the primitive mechanisms of a database that help express the coherence of tuples and present a novel way of organizing relational data in order to exploit the advantages of both, the row-oriented and the column-oriented world. As we go, we break with yet another bad habit of databases, namely the equal granularity of reads and writes which leads us to the introduction of consecutive clusters of disk pages called super-pages. Keywords: physical database design, column-oriented databases, compression in databases, read-optimized databases.

1

Introduction

Following a long tradition, most relational database management systems, commercial as well as open source, store tuples as one unit into the pages of the underlying storage [7]. Besides being the most obvious way, this is advantageous when tuples are written, because only a single page has to be modiﬁed. On the other hand, when reading many tuples only partly (i.e. only some of the attributes) which is common in data warehouse applications, these row-stores are forced to read whole tuples from disk to RAM and from RAM to the CPU cache, even if only parts of them are needed. This results in bad cache exploitation and longer query execution times. This insight led to the invention of column-oriented databases (a.k.a. column-stores) that split tables into one relation (or ﬁle) per attribute thus preventing the transfer of unneeded data. Furthermore, values of a single domain can be compressed better than tuples with several diﬀerently typed attributes which is known to have an even stronger impact on query performance than the simple omission of read operations[2,12]. The downside of column-stores however are longer write delays and the increased complexity for reassembling tuples. The reason for the former is obvious: when diﬀerent attributes are spread among diﬀerent disk pages, in the worst case all of these pages must be forced to disk when a single tuple is updated J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 362–377, 2009. c Springer-Verlag Berlin Heidelberg 2009

SPAX – PAX with Super-Pages

363

or inserted. Compression further complicates random updates since an update might change the compression ratio, thus necessitating a reorganization of multiple pages. In this paper we propose a novel way of organizing data on persistent storage in order to maximize read performance similar to column-stores but minimizing these disadvantages. Section 2 discusses related work, Section 3 gives a theoretical introduction to how tuples are held together in databases, Section 4 introduces SPAX as a new way of data placement which is evaluated experimentally in Section 5. We give with an outlook to further research in Section 6 and conclude with Section 7.

2

Related Work

Although the idea of column-orientation can be traced back to the 1970ies (see [6] for a list of papers), the academic and commercial interest in it has increased only in the recent decade. One reason is probably that only recently the idea of optimizing database systems for special purposes (like transaction processing, data analysis . . . ) became popular whereas previously a database system was supposed to be able to perform diverse tasks at once (see [17,18,13]). Since column-orientation is disadvantageous for writing, it is not an option for databases with a potential high write workload. The major representatives of relational column-oriented databases in academia currently are MonetDB [5,4] from the CWI in Amsterdam whose development started in the middle of the 1990ies and C-Store from MIT which has been developed as part of Daniel Abadi’s dissertation until 2008 [17,2,1,10]. Commercial column-stores include the Vertica Analytic Database1 which is based on C-Store, and Sybase IQ from Sybase Inc. [13]. We describe these systems in some detail in this chapter in order to build a taxonomy of physical data models in the next chapter. LucidDB2 is an open source column-store which we’d like to mention without further discussing it. There are a number of possible physical data models which might be described as “column-oriented”. One of the ﬁrst formally described data models was the decomposition storage model (DSM) [6] which splits tables into binary relations each consisting of a surrogate and a value with access paths for both of these. Thus each column of a conceptual relation is inherently indexed which makes the DSM especially useful for ad-hoc-queries. The original paper made no assumptions about how these structures are brought to stable media. As described in [14], ordinary B + -trees can be used for storing these binary relations, but this leads to suboptimal space utilization and decreased tuple assembly performance, because now a B + -tree has to be traversed for every attribute value. [14] and [9] propose lightweight B + -trees especially for this purpose. As pointed out by [6], compression is hard to accomplish with the DSM, because still values from diﬀerent domains have to be stored consecutively. 1 2

http://www.vertica.com http://www.luciddb.org

364

D. B¨ oßwetter

MonetDB [4], which was originally designed as a main memory database [5] uses the DSM with hashing to speedup searching their so called binary association tables thus preventing the problems of B + -trees explained above. Compression is generally not employed, although Boncz et. al. included compression into MonetDB as part of the X100 project [11]. Other characteristics of MonetDB include its non-standard architecture: it is comprised of a database kernel which is optimized to handle binary associations and several frontends that map diverse logical data models and query languages onto columns (e.g. relations and SQL, XML and XQuery, objects and OQL). Since several non-relational data models are supported, relational operators are not the lowest level of execution. Instead, an assembly language that describes simple operations on columns is provided as the interface to the kernel and frontends are supposed to compile and optimize their respective query language into this assembly language. C-Store [2] uses a rather diﬀerent storage model. It decomposes tables into overlapping subsets of columns called “projections” with each projection possibly sorted on diﬀerent attributes. Even attributes from diﬀerent relations can be stored in a projection as long as there is a 1:1 relationship between both relations. Projections are split into individual columns which are then compressed and stored to disk. C-Store uses “stitching-by-position” (see Section 3.2) when reassembling tuples from a projection with a covering set of columns, but it employs join indexes in the form of B + -trees when other projections need to be accessed3 . The process of transforming the logical data model (relations) into the physical model (projections and compression types) is data- and workloaddependent, i.e. samples of the data to be stored and the queries to be executed are required in order to ﬁnd a good data layout (either manually or automated). This makes C-Store less suitable for ad-hoc-queries or changing data- or workloadcharacteristics, but it seems to be adequate for real-world workloads, since the Vertica Analytic Database (C-Store’s commercial descendant) proceeds like this. Both C-Store and MonetDB’s SQL-frontend buﬀer write operations in a diﬀerent place and merge it into the original data on a regular basis. C-Store employs an uncompressed write-store and MonetDB uses diﬀerential ﬁles to store diﬀerences. No read-optimized data-layout is known which updates data directly. Another important achievement relating to read-optimized databases is PAX [3,12], which might be described as a sub-page column-store, because it stores the values of the same attribute of diﬀerent tuples together, but all attributes of the same tuple are still in the same disk page. We further discuss PAX in Section 4.1. There has been a proposal similar to SPAX called Multi-Resolution Block Storage Model [19], but their focus was on sequential scanning on mechanical hard-drives while ours is a mixture of random accesses and (in the future) random updates. Furthermore MBSM is optimized towards uncompressed ﬁxed-width data, while we aim at compression which might result in variable width ﬁelds. [16] uses a similar scheme for Flash memory. 3

Although [2] says that this is prevented by maintaining a projection for each table that completely covers it.

SPAX – PAX with Super-Pages

3

365

A Taxonomy of Data Coherence

One of the major tasks implicitly achieved by every relational database management system is to determine which attributes belong together. This is trivial in traditional, row-based systems, but column-oriented databases oﬀer a variety of mechanisms. We try to dissect these techniques in the following subsections into orthogonal basic steps and enumerate their respective pros and cons. 3.1

Collocation

The most obvious possibility to express coherence among multiple attributes is to collocate them, i.e. to put them into consecutive areas of the storage. In traditional row-stores, attributes of a tuple are placed consecutively into pages and thus assembling tuples is not necessary and projecting a subset of attributes is trivial. Variable-length attributes and NULL-values are not a problem since there are several ways to store them (see e.g. [7]). Column-stores can use Collocation as a helper technique, e.g. when relations are decomposed into surrogate/value-pairs as in DSM. The distance between attributes of a tuples is merely determined by the schema of the relation. The advantage of Collocation is the aforementioned ease-of-use and the disadvantages are a lack of compressibility and that all collocated columns must usually cross boundaries of the storage hierarchy at once. These were the primary reasons for introducing column-stores as noted above. 3.2

Position

When multiple columns are sorted by a common key and stored separately, it is possible to reconstruct tuples from attributes at equal positions in their respective columns (the ﬁrst value in column A belongs to the ﬁrst value in column B . . . ). The distance between attributes of a tuples is usually higher than with Collocation, because attributes are separated by a boundary between the units of the underlying storage (either between CPU cachelines as in PAX or between disk pages as in C-Store). There are 2 cases to consider: 1. Each attribute value requires an equal number of bits even if compression is applied. 2. Values have a variable width, either by their nature (like strings) or as a results of compression (e.g. RLE or Huﬀman encoding, see Section 6). In the ﬁrst case, ﬁnding the n’th entry of a column implies adding (n − 1) ∗ l to the column’s base address, where l is the length of an attribute must be known and constant (this process is sometimes called array projection). One advantage of storing values by themselves without organisational data in between is its potential for compression, because runs of multiple values can be compressed. The disadvantage is that projection as described above does not work with values of variable length or run-length-encoded data. Again, there are 2 alternatives to solve the problem:

366

D. B¨ oßwetter

– Data is “stitched” together by scanning several columns in turn and forming tuples from the next value of each (as done is C-Store). – Variable-length attributes are turned into ﬁxed-length attributes: • Padding each value to the maximum possible length for this column. This is usually not an alternative, because it contradicts compression. • Creating a dictionary that translates ﬁxed-length keys to variable-length values. Note that for high cardinality columns, this translation step results in Association (see Section 3.3), because a dictionary usually involves some kind of search structure, e.g. B + -trees with integer keys as described in [9]. • Introducing a slot-table with ﬁxed width oﬀsets to the actual data. This is a special form of a dictionary which requires only a constant number of page references (1 for the dictionary and 1 for the actual data). Furthermore, storing “naked” values in ﬁles makes it hard to insert into or update in the middle of the column without complete reorganization. A data structure which allows insertion (or updates that modify the space requirement) would result in B + -trees and thus in the performance described in Section 3.3. C-Store, which relies heavily on stitching-by-position [2], does not update the original data but has an uncompressed write-store which is transformed into the read-optimized store on a regular basis. 3.3

Association

When each attribute of a tuple is associated with a unique identiﬁer (be it a line number or a primary key), association techniques can be used to express coherence. Technically, this can be realized e.g. through B + -trees or by hashing. MonetDB uses hashing to implement the decomposed storage model (DSM) and C-Store uses B + -trees as join indexes between projections. The distance between attributes is similar to Position (see above), i.e. attributes of a tuple fall into diﬀerent disk blocks, but Association is usually not employed at the higher (faster and smaller) levels of the memory hierarchy. The advantage is that variable-length attributes can be used unconﬁned (in contrast to Position, see Section 3.2), but at the cost of speed: the disadvantage is that usually more page-references are required, either by traversing a B + -tree or by following bucket-chains when hashing is used. B + -trees have an access performance logarithmic to the size of the data leading to 3-4 hops from root to leaf in realistic scenarios and hashing ranges between constant and linear complexity.

4

The SPAX Way

As seen above, it is a non-trivial task to store data in columns without restrictions on updates or reassembly performance. Collocation is disadvantageous for cacheutilization and compression and Association requires slow search structures. As

SPAX – PAX with Super-Pages

367

proposed by [2], stitching-by-position is the fastest way to reconstruct tuples, but with tradeoﬀs concerning updates to the data when not applied carefully. In this section, we take a look at PAX, an alternative to column-orientation, and further enhance it to SPAX, a more general framework of data placement strategies in which PAX is a special case. 4.1

PAX Revisited

As an intermediary between column-stores and row-stores, Ailamaki et.al. introduced a data placement schema called Partitioning Attributes Across (PAX) [3]. While column-stores split attributes of a relation into diﬀerent disk pages, PAX splits them only into diﬀerent “minipages” inside the same page by sorting the contents of the page by attribute (see Figure 1). While the database system remains unchanged above the page level (including access paths like B + -trees) the organization inside pages is changed resulting in better CPU cache utilization and compressibility comparable to that of column-stores. Based on our taxonomy from above (Section 3), we can state that PAX replaces Collocation inside disk pages by Position. PAX has a comparable disk performance to the n-ary storage model (NSM) – at least when uncompressed – but it outperforms it in terms of CPU-cache utilization. This releases the CPU for other tasks and increases the throughput for main memory databases. We try to generalize these concept and enhance PAX to SPAX in the following section by applying reconstruction by Position across page boundaries. 4.2

Introducing SPAX

An implicit assumption made by all databases as well as by PAX is the following: disk pages are an organizational unit (including e.g. a header) and are

Fig. 1. Structure of a PAX page [3]

368

D. B¨ oßwetter

the granularity for reading as well as for writing. This is a heritage from transaction processing, because exactly this border between disk and RAM in the memory hierarchy must be crossed in order to guarantee durability (the “D” in ACID) which is usually accomplished by a write-ahead-log that is referenced by the page header of the data pages. No such thing applies to the boundary between CPU cache and RAM where applications only have indirect control over I/O operations. Since we regard read-mostly environments here, we propose to apply the principles of PAX to other layers of the memory hierarchy as well: we propose a PAX for super-pages, short SPAX, which clusters several disk pages into a super-page which is the organizational unit and which is written at once, but need never be read completely. All attributes of a number of tuples are contained inside the super-page and as in PAX they are grouped together by attribute instead of grouping them to tuples. In contrast to other databases however, only the part of a super-page which is actually required, will be read to RAM. A SPAX ﬁle is a ﬁle of consecutive super-pages, each consisting of an equal number of disk pages which are the units of read-operation. Figure 2 shows the layout of an example super-page consisting of 5 pages and containing all attributes for 17 tuples. Since updates and inserts are to be allowed and since we allow variable sized attributes, we need a B + -tree per SPAX ﬁle which is keyed by an integer surrogate (the line number) in order to ﬁnd the corresponding superpage. Otherwise we would not be able to ﬁnd a tuple by its row number, because every super-page can contain a diﬀerent number of tuples and super-pages are not guaranteed to be in insert-order. Inside the super-page, simple arithmetics can be used to ﬁnd the required attributes with ﬁxed width. There is a slot table at the end of the super-page which is used to access the variable length attributes which we store behind any ﬁxed size attributes. The biggest diﬀerence to PAX is that SPAX does not maintain an oﬀset table in the super-page header, because this would lead to the ﬁrst page having to be loaded in order to ﬁnd data in other pages4 . Instead we store the number of tuples contained in a super-page in the the corresponding B + -tree node. When traversing the tree with a search key k, the search will end up with the number of the containing super-page, the row number of its ﬁrst tuple and the total number of tuples contained in it. With these numbers, we can compute the oﬀset of a tuple inside an attribute-array and the size and position of these arrays (given the schema of the relation under concern). Some obvious advantages come to mind, some of which will be investigated in the following subsections: – B + -trees become more shallow because there are fewer but larger leaf nodes (see Section 4.3). – Only one tree per relation has to be traversed instead of one tree per attribute. This leads to faster tuple reconstruction compared to DSM. 4

Actually we experimented with such a header but dismissed it due to the resulting lack of predictability.

SPAX – PAX with Super-Pages

369

B+-tree leaf node (superblock 4711 with 17 tuples starting at tuple 54321)

17 values for attribute A

Page 1

17 values for attribute B

Page 2

Page 3

variable length attributes

Page 4

slot table

Page 5

Superblock 4711

Fig. 2. Layout of a SPAX super-page

– Main memory is used more eﬃciently than with PAX, because attributes which are not used will not waste space in the buﬀer pool5 . – Compression might become more eﬀective because longer runs of data can be compressed. Compression has not yet been considered at all as described in Section 6. – While SPAX resembles the reading techniques of C-Store, it might have better write performance, because whole super-pages can be written in one write-operation which is supposedly much faster than several smaller writes to diﬀerent ﬁles, at least on mechanical hard disks6 . 4.3

Influence on B + -Tree Height

As known from the literature [7], the height of a B + -tree grows logarithmically with the number of leaf nodes L with the fan-out F ∗ being the base of the logarithm: h = logF ∗ L (1) If we cluster f leaf pages into one super-page, the height of the B + -tree will decrease as follows: L ln(L) ln(f ) h = logF ∗ ( ) = logF ∗ (L) − logF ∗ (f ) = − f ln(F ∗ ) ln(F ∗ )

(2)

To estimate the time saved by traversing the B + -tree we build the quotient of decrease in height and the original height: q=

ln(f )/ln(F ∗ ) ln(f ) = ln(L)/ln(F ∗) ln(L)

(3)

As an example, when the ﬁlesystem’s page size is 4k, the size of our relation if 80M and the size of a cluster is 137 pages (f ), this leads to 49% decrease in time spent in the tree, as will be shown experimentally below. When 8G of data are used, the B + -trees height still decreases by 33%. 5 6

PAX targets main-memory databases and thus is still a legitimate approach. Flash memory will be considered in future work.

370

5 5.1

D. B¨ oßwetter

Experimental Analysis Experimental Setup

We try to prove the usefulness of our approach experimentally. We implemented SPAX in Java and used test data from the TPC-H benchmark to measure performance. We use the Linux operating system’s ﬁlesystem and buﬀer manager instead of developing our own. This implies that we have a ﬁxed page size of 4096 (4k) bytes and the size of the buﬀer pool spans almost all of the available RAM (1GB). The buﬀer is cleared before every test run, but pages once read will remain in the buﬀer until the end of the run. Linux supports a ﬁlesystem read-ahead function which reads blocks ahead of a current read operation in the order of appearance in the ﬁlesystem. Disk drives tend to read ahead blocks in the order of disk geometry. Both read-ahead features were turned oﬀ for our experiments. In contrast to PAX, where the size of disk page and cache lines are ﬁxed, we have to chose how many ﬁlesystem pages make up one super-page (the block factor). Since it is our goal to divide attributes by page boundaries, we estimate the optimal block factor for the lineitem table as follows: the sum of the lengths of all ﬁelds is approximately 137 bytes (with only one variable length attribute at the end, see Table 1) and the smallest attribute is 1 byte long. If we take clusters of 137 pages, most pages contain no more than data for 1 attribute. The only uncertainty is that variable length attributes might contain less than the maximum number of characters. We ignore this because employing statistics about the actual length would contradict our ad-hoc approach and smaller deviations should not have a strong impact on the overall result. Table 1. Attributes of the lineitem table in SPAX Attribute name L ORDERKEY L PARTKEY L SUPPKEY L LINENUMBER L QUANTITY L EXTENDEDPRICE L DISCOUNT L TAX L RETURNFLAG L LINESTATUS L SHIPDATE L COMMITDATE L RECEIPTDATE L SHIPINSTRUCT L SHIPMODE L COMMENT sum

SQL type INTEGER INTEGER INTEGER INTEGER DECIMAL(15,2) DECIMAL(15,2) DECIMAL(15,2) DECIMAL(15,2) CHAR(1) CHAR(1) DATE DATE DATE CHAR(25) CHAR(10) VARCHAR(44)

bytes in SPAX 4 4 4 4 4 4 4 4 1 1 8 8 8 25 10 44 137

SPAX – PAX with Super-Pages

371

The system model is as follows: Contrary to common belief, we found that some queries of TPC-H involve only a very small percentage of a relations tuples, which makes random accesses preferrable over sequential scans. We simulate this by a pseudo-random sequence of line numbers which we translate into tupleIDs through a B + -tree index which is constructed by the import process. We performed tests with varying selectivities (growing exponentially from 0.001% to 50%) and projectivities (1-16 of 16 attributes). We compare SPAX with a block factor of 137 to SPAX with a block factor of 1 (which is actually PAX, because all attributes are in the same page). This gives rise to the question if SPAX is any better than ordinary PAX with huge pages (137 ﬁlesystem pages), so we tested this as well. Since this leads us to two slightly diﬀerent implementations, both were also compared with a page size of 4096 (a SPAX block factor of 1) with the expectation that both behave equal. 5.2

Time Measurement

We divide the process of accessing a tuple into the following steps, each of which is measured seperately: B + -tree-traversal is the time required to lookup a given line number in the B + -tree resulting in the (super-)page number, the number of tuples in this (super-)page and the row number of the ﬁrst tuple in this (super-)page. Seeking the page does nothing in SPAX (at least nothing that requires substantial amounts of time) but it reads the whole page in PAX. Reading the attributes is the time it takes to actually read the desired data from the current (super-)page. This is very fast for PAX because the whole page is already in memory, but it takes some time for SPAX because nothing has been read before. The total time, which is approximately the sum of these, is also measured. 5.3

Results

Figures 3, 4, 5 and 6 all show 4 curves for the 4 variants that were measured (note the logarithmic scale): – – – –

PAX with a block factor of 1 (PAX 4k ) SPAX with a block factor of 1 (SPAX 4k ), which corresponds to PAX 4k PAX with a block factor of 137 (PAX 548k ) SPAX with a block factor of 137 (SPAX 548k )

The fact that the values for PAX and SPAX with 4k7 (super-)pages are identical in Figure 3 shows that both implementations behave very similarly (as expected) and may thus be compared. Most tests were conducted with a TPC-H scaling factor of 0.1 which results in 80M of lineitem data. Larger amounts (scaling factor 1) were used as well with 7

4k is the page size of the underlying ﬁlesystem.

372

D. B¨ oßwetter

1.6e+06 1.4e+06

time in ms

1.2e+06 1e+06 800000 600000 400000 200000 0 0.1

0.2

0.3

0.4

0.5

selectivity (fraction of tuples read) SPAX 4k SPAX 548k PAX 4k PAX 548k

Fig. 3. Total time by number of tuples

very similar results, but only SPAX 548k ﬁnished these tests within a moderate amount of time. Since 64 tests are necessary (16 possible projectivities and 4 possible selectivities) and one test with PAX 548k takes approximately 1 hour, the whole testrun would have taken almost 3 days for a scaling factor of 1, so we decided to present the results from the smaller (0.1) datasets. Figure 3 shows the total time required to retrieve a given percentage of tuples from the relation (averaged over all projectivities which we will discuss below). Both implementations have an almost constant behaviour with 4k pages which can be explained as follows: we used 80M of data (a TPC-H scaling factor of 0.1) with 4k pages, i.e. approximately 20.000 pages. There is a very high likelihood that we have randomly read all these pages into the cache after having read more than 3% of the 600.000 tuples (20.000 tuples) in the table. PAX 548k is not competitive to any of the other approaches and grows linearly. The SPAX 548k curve on the other hand increases only slightly ending at about 1/5 the time of (S)PAX 4k for reading 50% of the tuples. Figure 4 shows the reason for the bad behaviour of PAX with large page sizes: with an increasing number of tuples to be read, the time for reading large pages becomes the dominating factor for the time needed. This ﬁgure contains only 0 values for the SPAX implementation because it actually does nothing in the seek phase as described above. Figure 5 shows the time required for traversing the B + -tree. As estimated in Section 4.3, SPAX needs an order of magnitude less time than PAX. Even better: instead of the estimated 49% decrease, we save 80% time on the B + tree-traversal which is probably due to caching. The reason that PAX 548k rises more steeply than SPAX 548k is probably that PAX wastes more memory and thus the B + -tree is less likely cached.

SPAX – PAX with Super-Pages

373

1.4e+06 1.2e+06

time in ms

1e+06 800000 600000 400000 200000 0 0.1

0.2

0.3

0.4

0.5

selectivity (fraction of tuples read) SPAX 4k SPAX 548k PAX 4k PAX 548k

Fig. 4. Time for seeking a new super-page by number of tuples

30000

time in ms

25000 20000 15000 10000 5000 0 0.1

0.2

0.3

0.4

0.5

selectivity (fraction of tuples read) SPAX 4k SPAX 548k PAX 4k PAX 548k

Fig. 5. Time for B + -tree-traversal by number of tuples

Figure 6 shows the time required to actually read the attribute data after the (super-)page has been opened. It is obvious that in this case, the PAX implementation performs better, because it has already read all data before, whereas SPAX now has to read the requested data.

374

D. B¨ oßwetter

200000

time in ms

150000

100000

50000

0 0.1

0.2

0.3

0.4

0.5

selectivity (fraction of tuples read) SPAX 4k SPAX 548k PAX 4k PAX 548k

Fig. 6. Time for reading attribute data by number of tuples

5.4

The Impact of Projectivity

The times presented above are all averages over all projectivities. Figure 7 shows the impact of projectivity for the 4 variants: only in SPAX 548k does retrieving less attributes increase the query performance. The PAX-based techniques require a constant amount of time, no matter how many attributes are actually retrieved.

6

Future Research: Compression

One key aspect of column-oriented databases is compression[2]. Until now however, we did not take into account the eﬀects of compression to SPAX, namely the problem of projecting values from compressed arrays. SPAX relies heavily on the fact that most columns are ﬁxed value and turns VARCHARs into ﬁxed value columns by maintaining a slot-table. When introducing compression however, the simple pointer arithmetic used for array projection will no longer work. The following compression types might be considered: RLE (runlength-encoding) compresses runs of equal values into a pair (v, c) with the value v and the number n of times the value occurred[1]. FOR (frame-of-reference) can be applied to numerical data stored on a paged medium by subtracting the smallest value in a page from all the others in the same page resulting in less bits required for storing them. The smallest value is the frame of reference and is stored in a page header [8]. DELTA compression works for numerical data and only stores a base value (maybe one per page) and the diﬀerence between consecutive values.

SPAX – PAX with Super-Pages

375

350000 300000

time in ms

250000 200000 150000 100000 50000 0 1

3

5

7

9

11

13

15

number of attributes read SPAX 4k SPAX 548k PAX 4k PAX 548k

Fig. 7. Average time by number of attributes

DICT dictionary compression replaces values by smaller codes which can be looked up in a dictionary[1]. SPAX depends on calculations with the width of columns as explained in Section 4. If compression is to be employed, it must still be possible to determine how many attribute values ﬁt into a block of data. Header information (e.g. base value of a DELTA-compression) will further complicate this process. At ﬁrst sight, FOR, DELTA and DICT fulﬁll this requirement, because they store a ﬁxed width value per attribute value. This makes them eligible candidates for compression in SPAX, although DELTA prohibits projection since every value depends on its predecessor. DICT can only be used with codes of constant width, which renders Huﬀman encoding useless for our purpose. RLE however can store an indeﬁnite number of values in a given chunk of data and hence can not be used oﬀ hand. We will investigate this subject in our future work.

7

Conclusion

We introduced SPAX, a novel kind of data-placement which resembles PAX on a diﬀerent layer of the memory hierarchy. SPAX tries to ﬁnd a compromise between Association and Position, two of the techniques we identiﬁed for expressing tuple coherence in databases. We showed experimentally, that SPAX outperforms PAX for disk-based read-only workloads by an order of magnitude. There are however some more interesting points to be investigated:

376

D. B¨ oßwetter

Write-performance (bulk-load/insert/update) has to be evaluated in comparison to other data-placement strategies. Read-performance must be compared to column-stores (e.g. by implementing DSM on (S)PAX). Compression has to be taken into account (see Section 6). Flash memory needs to be considered, since the SPAX model with large superpages to be written and smaller blocks to be read ﬁts neatly into the model of big erase units and smaller pages in ﬂash memory as described e.g. in [15].

References 1. Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: SIGMOD 2006: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 671–682. ACM, New York (2006) 2. Abadi, D.J.: Query Execution in Column-Oriented Database Systems. PhD thesis, Massachusetts Institute of Technology (February 2008) 3. Ailamaki, A., Dewitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 169–180. Morgan Kaufmann Publishers Inc., San Francisco (2001) 4. Boncz, P.A.: Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. Ph.d. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands (May 2002) 5. Boncz, P.A., Kersten, M.L.: Monet: An Impressionist Sketch of an Advanced Database System. In: Proceedings Basque International Workshop on Information Technology, San Sebastian, Spain (July 1995) 6. Copeland, G.P., Khoshaﬁan, S.N.: A decomposition storage model. In: SIGMOD 1985: Proceedings of the 1985 ACM SIGMOD international conference on Management of data, pp. 268–279. ACM, New York (1985) 7. Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999) 8. Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: ICDE 1998: Proceedings of the Fourteenth International Conference on Data Engineering, Washington, DC, USA, pp. 370–379. IEEE Computer Society, Los Alamitos (1998) 9. Graefe, G.: Eﬃcient columnar storage in b-trees. SIGMOD Rec. 36(1), 3–6 (2007) 10. Harizopoulos, S., Liang, V., Abadi, D.J., Madden, S.: Performance tradeoﬀs in read-optimized databases. In: VLDB 2006: Proceedings of the 32nd international conference on Very large data bases, pp. 487–498. VLDB Endowment (2006) 11. Heman, S., Zukowski, M., de Vries, A.P., Boncz, P.A.: Eﬃcient and Flexible Information Retrieval Using MonetDB/X100. In: Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA (January 2007); (Demo Paper) 12. Holloway, A., Dewitt, D.: Read-optimized databases, in-depth. In: VLDB 2008 (2008) 13. MacNicol, R., French, B.: Sybase iq multiplex - designed for analytics. In: VLDB, pp. 1227–1230 (2004)

SPAX – PAX with Super-Pages

377

14. Ramamurthy, R., Dewitt, D.J., Su, Q.: A case for fractured mirrors. The VLDB Journal 12(2), 89–101 (2003) 15. Ross, K.A.: Modeling the performance of algorithms on ﬂash memory devices. In: DaMoN 2008: Proceedings of the 4th international workshop on Data management on new hardware, pp. 11–16. ACM, New York (2008) 16. Shah, M.A., Harizopoulos, S., Wiener, J.L., Graefe, G.: Fast scans and joins using ﬂash drives. In: DaMoN 2008: Proceedings of the 4th international workshop on Data management on new hardware, pp. 17–24. ACM, New York (2008) 17. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.: C-store: a column-oriented dbms. In: VLDB 2005: Proceedings of the 31st international conference on Very large data bases, pp. 553–564. VLDB Endowment (2005) 18. Stonebraker, M., Cetintemel, U.: “one size ﬁts all”: an idea whose time has come and gone. In: Proceedings of 21st International Conference on Data Engineering, ICDE 2005, pp. 2–11 (2005) 19. Zhou, J., Ross, K.A.: A multi-resolution block storage model for database design. In: IDEAS, pp. 22–33. IEEE Computer Society, Los Alamitos (2003)

Author Index

Afrati, Foto 164 Amghar, Youssef 15 Andonoﬀ, Eric 2

Koralalage, Krishan Sabaragamu Kormilitsin, Maxim 133 Lehner, Wolfgang

Behrend, Andreas 179, 286 Benharkat, A¨ıcha-Nabila 15 Bobed, Carlos 103 Boehm, Matthias 253 B¨ oßwetter, Daniel 362 Bouaziz, Raﬁk 2 Boukhebouze, Mohamed 15 Bouzguenda, Lotﬁ 2 Brantner, Matthias 1 Braun, Susanne 219 Burza´ nska, Marta 194 Car´elo, Caio C´esar Mori 235 Chaˆ abane, Mohamed Amine 2 Chandrachud, Manik 164 Chirkova, Rada 133, 164 Ciferri, Cristina Dutra de Aguiar Ciferri, Ricardo Rodrigues 235 Cipriani, Nazario 74 Corral, Antonio 103 Doerr, Martin 270 Dorau, Christian 179 Fathi, Yahya Felea, Victor

133 206

Grossmann, Matthias

74

Habich, Dirk 253 H¨ arder, Theo 88, 149 Heﬀernan, Neil 43 Ilarri, Sergio

Maamar, Zakaria 15 Makna, Janis 28 Mani, Murali 43 Manthey, Rainer 179, 286 Mena, Eduardo 103 Mitra, Prasenjit 164 Nicklas, Daniela

74

Patroumpas, Kostas 118 Patvarczki, Jozsef 43 Plexousakis, Dimitris 270 Pola, Ives Renˆe Venturini 235 Preissler, Steﬀen 253 235

Radha Krishna, P. 301 Revesz, Peter 347 Ribeiro, Leonardo A. 88 Sch¨ uller, Gereon 286 Sellis, Timos 118 Sidl´ o, Csaba Istv´ an 59 Stallmann, Matthias 133 Stencel, Krzysztof 194 Traina, Agma Juci Machado Traina-Jr., Caetano 235 Triplet, Thomas 347

235

Vassilakopoulos, Michael Gr. Vidyasankar, K. 301

330

330 301

Weiner, Andreas M. 149 Wieland, Matthias 74 Wieneke, Monika 286 Wi´sniewski, Piotr 194 Wloka, Uwe 253

270

Yoshiura, Noriaki

103

Karanikolas, Nikitas N. Karlapalem, Kamalakar Klein, Joachim 219 Kondylakis, Haridimos

253

314

314

Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI)

E-Commerce and Web Technologies: 10th International Conference, EC-Web 2009, Linz, Austria, September 1-4, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Web Information Systems Engineering - WISE 2009: 10th International Conference, Poznen, Poland, October 5-7, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Entertainment Computing -- ICEC 2009: 8th International Conference, ICEC 2009, Paris, France, September 3-5, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Advances in Databases and Information Systems: 8th East European Conference, ADBIS 2004, Budapest, Hungary, September 22-25, 2004, Proceedings

Advances in Spatial and Temporal Databases: 11th International Symposium, SSTD 2009 Aalborg, Denmark, July 8-10, 2009 Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

The Semantic Web - ISWC 2009: 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Advances in Databases and Information Systems: Third East European Conference, ADBIS'99, Maribor, Slovenia, September 13-16, 1999, Proceedings

Advances in Databases and Information Systems: 9th East European Conference, ADBIS 2005, Tallinn, Estonia, September 12-15, 2005, Proceedings

Advances in Databases and Information Systems: 11th East European Conference, ADBIS 2007, Varna, Bulgaria, September 29-October 3, 2007, Proceedings

Advances in Web Based Learning - ICWL 2009: 8th International Conference, Aachen, Germany, August 19-21, 2009, Proceedings (Lecture Notes in Computer Science ... Applications, incl. Internet Web, and HCI)

Advances in Databases and Information Systems: 6th East European Conference, ADBIS 2002, Bratislava, Slovakia, September 8-11, 2002, Proceedings

Advances in Databases and Information Systems: 7th East European Conference, ADBIS 2003, Dresden, Germany, September 3-6, 2003, Proceedings

Advances in Databases and Information Systems: 5th East European Conference, ADBIS 2001, Vilnius, Lithuania September 25-28, 2001 Proceedings

Web Reasoning and Rule Systems: Third International Conference, RR 2009, Chantilly, VA, USA, October 25-26, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Advances in Databases and Information Systems: 10th East European Conference, ADBIS 2006, Thessaloniki, Greece, September 3-7, 2006, Proceedings

Advances in Web Based Learning - ICWL 2009: 8th International Conference, Aachen, Germany, August 19-21, 2009, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)

Haptic and Audio Interaction Design: 4th International Conference, HAID 2009 Dresden, Germany, September 10-11, 2009 Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Research and Advanced Technology for Digital Libraries: 13th European Conference. ECDL 2009, Corfu, Greece, September 27 - October 2, 2009, ... Applications, incl. Internet Web, and HCI)

Secure Data Management: 6th VLDB Workshop, SDM 2009, Lyon, France, August 28, 2009, Proceedings (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Electronic Government: 8th International Conference, EGOV 2009, Linz, Austria, August 31 - September 3, 2009, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)

Data Warehousing and Knowledge Discovery: 11th International Conference, DaWaK 2009 Linz, Austria, August 31-September 2, 2009 Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Human-Computer Interaction. New Trends: 13th International Conference, HCI International 2009, San Diego, CA, USA, July 19-24, 2009, Proceedings, Part ... Applications, incl. Internet Web, and HCI)

Business Process Management: 7th International Conference, BPM 2009, Ulm, Germany, September 8-10, 2009, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)

Ambient Intelligence: European Conference, AmI 2009, Salzburg, Austria, November 18-21, 2009. Proceedings (Lecture Notes in Computer Science ... Applications, incl. Internet Web, and HCI)

Web Engineering: 9th International Conference, ICWE 2009 San Sebastián, Spain, June 24-26 2009 Proceedings (Lecture Notes in Computer Science ... Applications, incl. Internet Web, and HCI)

Advances in Databases and Information Systems, 12 conf., ADBIS 2008

Advances in Databases and Information Systems: Second East European Symposium, ADBIS '98, Poznan, Poland, September 7-10, 1998, Proceedings

Advances in Databases and Information Systems - ADBIS 2011

Web and Wireless Geographical Information Systems: 9th International Symposium, W2GIS 2009, Maynooth, Ireland, December 7-8, 2009. Proceedings ... Applications, incl. Internet Web, and HCI)